A machine readable version of this document and associated data can be found at github.com/mg14/AML-multistage.
We use the data from N=1,540 AML cases as described in our companion paper (Papaemmanuil et al. 2015).
These can be summarised as follows:
| Group | Symbol | p | Variables |
|---|---|---|---|
| Fusion genes | Fusions | 9 | t_MLL, inv3_t3_3, t_9_22, t_15_17, t_8_21, inv16_t16_16, t_6_9, t_9_11, t_v_11 |
| Copy number alterations | CNA | 18 | minus5_5q, minus7, minus7q, abn7other, plus8_8q, minus9q, mono12_12p_abn12p, plus13, mono17_17p_abn17p, minus18_18q, minus20_20q, plus21, plus22, minusY, abn3q_other, plus11_11q, mono4_4q_abn4q, complex |
| Point mutations in single genes | Genetics | 58 | ASXL1, ATRX, BCOR, BRAF, CBL, CBLB, CDKN2A, CREBBP, CUX1, DNMT3A, EP300, ETV6, EZH2, FBXW7, GATA2, GNAS, IDH1, IKZF1, JAK2, KDM5A, KDM6A, KIT, KRAS, MLL, MLL2, MLL3, MLL5, MPL, MYC, NF1, NPM1, NRAS, PHF6, PRPF40B, PTEN, PTPN11, RAD21, RB1, RUNX1, SF1, SF3A1, SF3B1, SFRS2, SH2B3, STAG2, TET2, TP53, U2AF1, U2AF2, WT1, ZRSR2, CEBPA_mono, CEBPA_bi, FLT3_ITD, FLT3_TKD, FLT3_other, IDH2_p172, IDH2_p140 |
| Gene:Gene product terms | Gene:Gene | 126 | BCOR:DNMT3A, ASXL1:EZH2, DNMT3A:IDH1, DNMT3A:KRAS, DNMT3A:MLL, IDH1:MLL, DNMT3A:MYC, DNMT3A:NF1, CBL:NPM1, DNMT3A:NPM1, GATA2:NPM1, IDH1:NPM1, KIT:NPM1, KRAS:NPM1, MYC:NPM1, NF1:NPM1, ASXL1:NRAS, BCOR:NRAS, DNMT3A:NRAS, EZH2:NRAS, GATA2:NRAS, IDH1:NRAS, KIT:NRAS, KRAS:NRAS, MLL:NRAS, NPM1:NRAS, NPM1:PHF6, DNMT3A:PTPN11, IDH1:PTPN11, KRAS:PTPN11, NPM1:PTPN11, NRAS:PTPN11, DNMT3A:RAD21, NPM1:RAD21, NRAS:RAD21, PTPN11:RAD21, ASXL1:RUNX1, BCOR:RUNX1, DNMT3A:RUNX1, EZH2:RUNX1, IDH1:RUNX1, MLL:RUNX1, NRAS:RUNX1, PHF6:RUNX1, NRAS:SF3B1, ASXL1:SFRS2, DNMT3A:SFRS2, IDH1:SFRS2, NPM1:SFRS2, NRAS:SFRS2, RUNX1:SFRS2, ASXL1:STAG2, DNMT3A:STAG2, EZH2:STAG2, MLL:STAG2, NPM1:STAG2, NRAS:STAG2, RUNX1:STAG2, SFRS2:STAG2, ASXL1:TET2, DNMT3A:TET2, KIT:TET2, MLL:TET2, NPM1:TET2, NRAS:TET2, PTPN11:TET2, RUNX1:TET2, SFRS2:TET2, STAG2:TET2, DNMT3A:TP53, NRAS:TP53, NRAS:U2AF1, NPM1:WT1, NRAS:WT1, DNMT3A:CEBPA_mono, NPM1:CEBPA_mono, TET2:CEBPA_mono, GATA2:CEBPA_bi, NRAS:CEBPA_bi, WT1:CEBPA_bi, DNMT3A:FLT3_ITD, EZH2:FLT3_ITD, IDH1:FLT3_ITD, MLL:FLT3_ITD, MYC:FLT3_ITD, NPM1:FLT3_ITD, NRAS:FLT3_ITD, PHF6:FLT3_ITD, PTPN11:FLT3_ITD, RAD21:FLT3_ITD, RUNX1:FLT3_ITD, STAG2:FLT3_ITD, TET2:FLT3_ITD, WT1:FLT3_ITD, CEBPA_mono:FLT3_ITD, CEBPA_bi:FLT3_ITD, DNMT3A:FLT3_TKD, IDH1:FLT3_TKD, MLL:FLT3_TKD, NPM1:FLT3_TKD, NRAS:FLT3_TKD, RAD21:FLT3_TKD, RUNX1:FLT3_TKD, TET2:FLT3_TKD, WT1:FLT3_TKD, FLT3_ITD:FLT3_TKD, DNMT3A:FLT3_other, NPM1:FLT3_other, NRAS:FLT3_other, PTPN11:FLT3_other, RUNX1:FLT3_other, TET2:FLT3_other, FLT3_ITD:FLT3_other, DNMT3A:IDH2_p172, ASXL1:IDH2_p140, DNMT3A:IDH2_p140, MLL:IDH2_p140, NPM1:IDH2_p140, NRAS:IDH2_p140, PTPN11:IDH2_p140, RUNX1:IDH2_p140, SFRS2:IDH2_p140, STAG2:IDH2_p140, FLT3_ITD:IDH2_p140, FLT3_TKD:IDH2_p140, NPM1:FLT3_ITD:DNMT3A |
| Clinical parameters | Clinical | 11 | Performance_ECOG, BM_Blasts_100, PB_Blasts_100, wbc_100, LDH_1000, HB_10, platelet_100, Splenomegaly, oAML, sAML, tAML |
| Demographical variables | Demographics | 2 | AOD_10, gender |
| Treatment | Treatment | 4 | ATRA, VPA, HSCT in CR1, HSCT after relapse |
| Nuisance terms | Nuisance | 4 | AMLHD98A, AMLHD98B, Date_1000, MissingCyto |
Note that Point mutations in single genes also includes FLT3 intra-tandem duplications (ITD), as well as MLL partial tandem duplications (PTD), which are large indels.
The following preprocessing steps were applied:
Fusion genes, Copy number alterations, Genetics and Gene:Gene interactions were encoded as 0 (absent) and 1 (present) based on the same annotation as in (Papaemmanuil et al. 2015). A gene was considered mutated and encoded as 1 if it contained at least one oncogenic mutation, and 0 otherwise. In addition, we used the following rules:
Gene:Gene product terms were computed indicating whether a combination of two genes was present. This allows to account for non-additive genetic interaction. To limit the number of variables, product terms included if there were at least 8 occurrences.
Quantitative clinical variables were rescaled to a magnitude of 1 as described above. To assess the validity of out log-linear risk model we computed spline fits, that allow for a non-linear dependence between log-hazard and each variable. We did not observe a measurable improvement of our model fits in cross-validation.
The following variables were included in the model:
In addition to the aforementioned explanatory variables we used the following multiplicative strata to account for potential confounding factors:
Load a few libraries - see end of document for a full list of libraries and their versions.
library(CoxHD)
library(mg14)
set1 <- brewer.pal(9, "Set1")
We use the following steps for processing of the original data. Note that, for privacy reasons, we cannot distribute clinical data with the actual event dates and instead provide these data in an anonymised form.
clinicalData <- read.table("../data/AMLSG_Clinical.txt", sep="\t", header=TRUE, na.strings = "na", comment.char = "", quote="\"")
clinicalData <- clinicalData[order(clinicalData$PDID),]
clinicalData$ERDate <- as.Date(as.character(clinicalData$ERDate), "%d-%b-%y")
clinicalData$CR_date <- as.Date(as.character(clinicalData$CR_date), "%d-%b-%y")
clinicalData$TPL_date <- as.Date(as.character(clinicalData$TPL_date), "%d-%b-%y")
clinicalData$Date_LF <- as.Date(as.character(clinicalData$Date_LF), "%d-%b-%y")
clinicalData$Recurrence_date <- as.Date(as.character(clinicalData$Recurrence_date), "%d-%b-%y")
levels(clinicalData$Study) <- c(`_07-04`="AMLSG0704" , `98A`="AMLHD98A" , `98B`="AMLHD98B")[levels(clinicalData$Study)]
clinicalData$Study <- factor(as.character(clinicalData$Study))
clinicalData$VPA[is.na(clinicalData$VPA)] <- 0
clinicalData$ATRA_arm[is.na(clinicalData$ATRA_arm)] <- 0
colnames(clinicalData) <- gsub('\\.',"",colnames(clinicalData))
clinicalData <- clinicalData[!is.na(clinicalData$TypeAML),] ## remove unknown patients
clinicalData$PDID <- factor(as.character(clinicalData$PDID))
t <- read.table("../data/AMLSG_Karyotypes.txt", header=T, sep="\t", na.strings = "na",comment.char = "", quote="\"")
karyotypes <- t$karyotype[match(clinicalData$PDID,t$PDID)]
rm(t)
clinicalData$t_9_11 <- grepl("t\\(9;11\\)\\(p22;q23\\)", karyotypes) + 0 # t(9;11)
clinicalData$t_v_11 <- clinicalData$t_MLL &! clinicalData$t_9_11
clinicalData$t_MLL <- NULL
dim(clinicalData)
## [1] 1540 106
Here we store the data in an anonymised form, where all references to the actual date of diagnoses are removed. These data are available on github.
e <- clinicalData$ERDate
clinicalData$ERDate <- clinicalData$ERDate - e
clinicalData$CR_date <- clinicalData$CR_date - e
clinicalData$Date_LF <- clinicalData$Date_LF - e
clinicalData$TPL_date <- clinicalData$TPL_date - e
clinicalData$Recurrence_date <- clinicalData$Recurrence_date - e
save(clinicalData, file="../data/AMLSG_Clinical_Anon.RData")
Load the data using
load("../data/AMLSG_Clinical_Anon.RData")
mutationData = read.table("../data/AMLSG_Genetic.txt", sep="\t", header=TRUE, strip.white = TRUE)
mutationData$SAMPLE_NAME <- factor(as.character(mutationData$SAMPLE_NAME), levels = levels(clinicalData$PDID)) ## Refactor
mutationTable <- (table(mutationData[mutationData$Result %in% c("ONCOGENIC","POSSIBLE") & mutationData$FINAL_CALL == "OK" ,c("SAMPLE_NAME","GENE")]) > 0)+0
dim(mutationTable)
## [1] 1540 54
all(rownames(mutationTable)==clinicalData$PDID)
## [1] TRUE
os <- Surv(clinicalData$OS, clinicalData$Status) #OS
t <- clinicalData$Time_Diag_TPL
t[is.na(t) | !clinicalData$TPL_Phase %in% "CR1" | !clinicalData$TPL_type %in% c("ALLO","FREMD") ] <- Inf ## Only allografts in CR1
o <- clinicalData$OS
tplIndexOs <- t < o
osTD <- Surv(time = rep(0, nrow(clinicalData)), time2=pmin(o, t), event=ifelse(tplIndexOs, 0, clinicalData$Status) )
osTD <- rbind(osTD,
Surv(time=t[which(tplIndexOs)],
time2=o[which(tplIndexOs)],
event=clinicalData$Status[which(tplIndexOs)])
)
osTD = Surv(osTD[,1],osTD[,2],osTD[,3])
rm(o,t)
tplSplitOs <- c(1:nrow(clinicalData), which(tplIndexOs))
osYr <- os
osYr[,1] <- osYr[,1]/365
osYrTD <- osTD
osYrTD[,1] <- osYrTD[,1]/365
All data as list
dataList <-list(Genetics = data.frame(mutationTable[,colSums(mutationTable)>0]),
Cytogenetics = clinicalData[,grep("^(t_)|(inv)|(abn)|(plus)|(minus)|(mono)|(complex)",colnames(clinicalData))],
Nuisance = data.frame( MakeInteger(clinicalData$Study)[,1:2], Date=scale(as.numeric(clinicalData$ERDate), scale=FALSE), MissingCyto=is.na(clinicalData$t_15_17)+0),
Treatment = data.frame(ATRA = clinicalData$ATRA_arm, VPA=clinicalData$VPA, TPL_os=tplIndexOs),
Demographics = clinicalData[,c("AOD","gender")],
Clinical = cbind(clinicalData[, c("Performance_ECOG","BM_Blasts","PB_Blasts","wbc","LDH","HB","platelet","Splenomegaly")], MakeInteger(clinicalData$TypeAML)[,-1]))#,
#MolRisk = makeInteger(clinicalData$M_Risk))
#dataList$Genetics$CEBPA <- clinicalData$CEBPA # encoded as 0,1,2
dataList$Genetics$CEBPA_mono <- clinicalData$CEBPA == 1 # encoded as 0,1,2
dataList$Genetics$CEBPA_bi <- clinicalData$CEBPA == 2 # encoded as 0,1,2
dataList$Genetics$CEBPA <- NULL
dataList$Genetics$FLT3 <- NULL
dataList$Genetics$FLT3_ITD <- clinicalData$FLT3_ITD != "0"
dataList$Genetics$FLT3_TKD <- clinicalData$FLT3_TKD != "0"
dataList$Genetics$FLT3_other <- clinicalData$FLT3_other != "0"
dataList$Genetics$IDH2_p172 <- table(mutationData$SAMPLE_NAME[mutationData$GENE=='IDH2' & grepl("172", mutationData$AA_CHANGE)])[]
dataList$Genetics$IDH2_p140 <- table(mutationData$SAMPLE_NAME[mutationData$GENE=='IDH2' & grepl("140", mutationData$AA_CHANGE)])[]
dataList$Genetics$IDH2 <- NULL
dataList$Genetics$NPM1 <- clinicalData$NPM1
dataList$Cytogenetics$MLL_PTD <- NULL
dataList$Genetics = dataList$Genetics + 0
dataList$GeneGene <- MakeInteractions(data.frame(dataList$Genetics), data.frame(dataList$Genetics))[,as.vector(upper.tri(matrix(0,ncol=ncol(dataList$Genetics), nrow=ncol(dataList$Genetics))))]
dataList$GeneGene <- dataList$GeneGene[,colSums(dataList$GeneGene, na.rm=TRUE)>0]
dataList$GeneGene$`NPM1:FLT3_ITD:DNMT3A` <- (rowSums(dataList$Genetics[c('NPM1',"FLT3_ITD","DNMT3A")])==3)+0 ## Add NPM1:FLT3_ITD:DNMT3A product term as well
dataList$CytoCyto <- MakeInteractions(dataList$Cytogenetics, dataList$Cytogenetics)[,sapply(1:ncol(dataList$Cytogenetics), `<`, 1:ncol(dataList$Cytogenetics))]
dataList$CytoCyto <- dataList$CytoCyto[, colSums(dataList$CytoCyto, na.rm=TRUE) > 0]
dataList$GeneCyto <- MakeInteractions(dataList$Genetics, dataList$Cytogenetics)
dataList$GeneCyto <- dataList$GeneCyto[,colSums(dataList$GeneCyto, na.rm=TRUE) > 0]
dataList$GeneTreat <- MakeInteractions(dataList$Genetics, dataList$Treatment)
dataList$GeneTreat <- dataList$GeneTreat[,colSums(dataList$GeneTreat, na.rm=TRUE) > 0]
dataList$CytoTreat <- MakeInteractions(dataList$Cytogenetics, dataList$Treatment)
dataList$CytoTreat <- dataList$CytoTreat[,colSums(dataList$CytoTreat, na.rm=TRUE) > 0]
Condensing to a data.frame
dataRaw <- do.call(cbind,dataList)
names(dataRaw) <- unlist(sapply(dataList, names))
dataFrame <- StandardizeMagnitude(dataRaw)
dim(dataFrame)
## [1] 1540 1827
groups <- unlist(sapply(names(dataList), function(x) rep(x, ncol(dataList[[x]]))))
groups[grepl("^(t_)|(inv)", colnames(dataFrame)) &! grepl(":", colnames(dataFrame))] <- "Fusions"
groups[groups=="Cytogenetics"] <- "CNA"
groups <- factor(groups)
names(groups) <- colnames(dataFrame)
table(groups)
## groups
## CNA Clinical CytoCyto CytoTreat Demographics Fusions GeneCyto GeneGene GeneTreat
## 18 11 183 70 2 8 606 721 143
## Genetics Nuisance Treatment
## 58 4 3
Poor man’s imputation by column means
poorMansImpute <- function(x) {x[is.na(x)] <- mean(x, na.rm=TRUE); return(x)}
dataFrame <- as.data.frame(sapply(dataFrame, poorMansImpute))
rownames(dataFrame) <- clinicalData$PDID
copyNumbers = cbind(dataList$Cytogenetics[grep(c("minus|plus|mono"), colnames(dataList$Cytogenetics))], clinicalData$gender)
copyNumbers$minus7 <- (copyNumbers$minus7 | copyNumbers$minus7q) +0
copyNumbers$minus7q <- NULL
for(i in 1:ncol(copyNumbers)){
if(grepl("plus", colnames(copyNumbers)[i]))
copyNumbers[,i] = copyNumbers[,i] * 3
}
copyNumbers[copyNumbers==0 | is.na(copyNumbers)] = 2
colnames(copyNumbers) = c(5,7,8,9,12,13,17,18,20,21,22,"Y",11,4,"X")
rownames(copyNumbers) <- clinicalData$PDID
copyNumbers$Y <- c(1:0)[clinicalData$gender] - mg14::na.zero(dataList$Cytogenetics$minusY)
cn = sapply(1:nrow(mutationData), function(i) {c=copyNumbers[mutationData$SAMPLE_NAME[i],match(mutationData$CHR[i], colnames(copyNumbers))]; if(length(c)==0) 2 else c})
vaf <- as.numeric(as.character(mutationData$X._MUT_IN_TUM))
## Warning: NAs introduced by coercion
depth <- as.numeric(as.character(mutationData$TUM_DEPTH))
## Warning: NAs introduced by coercion
dataFLT3_ITD <- read.table("../data/AMLSG_FLT3ITD.txt", sep="\t", header=TRUE)
dataFLT3_ITD$Sample <- sub("WGA_","", dataFLT3_ITD$Sample)
mcf <- vaf/100*cn ## Approx mutant cell fraction, assuming mutations on only one copy
mcf[which(mcf > 1.25)] <- vaf[which(mcf > 1.25)] ## Probably over adjusted
mcf[mcf > 1] <- 1 ## Random fluctuations
genesClonal <- dataFrame[groups=="Genetics"]
precedence <- matrix(0, nrow=ncol(genesClonal), ncol = ncol(genesClonal) , dimnames=list(colnames(genesClonal), colnames(genesClonal)))
plist <- list()
lesions <- as.character(mutationData$GENE)
lesions[mutationData$GENE=='IDH2' & grepl("172", mutationData$AA_CHANGE)] <- "IDH2_p172"
lesions[mutationData$GENE=='IDH2' & grepl("140", mutationData$AA_CHANGE)] <- "IDH2_p140"
lesions[mutationData$GENE=='FLT3' & grepl(paste(835:841, collapse="|"), mutationData$AA_CHANGE)] <- "FLT3_TKD"
lesions[mutationData$GENE=='FLT3' & grepl("ITD", mutationData$AA_CHANGE)] <- "FLT3_ITD"
lesions[lesions=="FLT3"] <- "FLT3_other"
# Add FLT3_ITD VAF, not the most accurate presumably, due to mapping problems for ITDs..
i <- lesions == "FLT3_ITD"
m <- match(mutationData$SAMPLE_NAME[i], dataFLT3_ITD$Sample)
mcf[i] <- as.numeric(as.character(dataFLT3_ITD$Read_count[m]))/dataFLT3_ITD$Coverage[m]
## Warning: NAs introduced by coercion
depth[i] <- dataFLT3_ITD$Coverage[m]
ix= lesions %in% colnames(precedence) & mutationData$Result %in% c("ONCOGENIC","POSSIBLE")
for(s in clinicalData$PDID){
l <- list()
for(i in which(mutationData$SAMPLE_NAME==s & ix))
for(j in which(mutationData$SAMPLE_NAME==s & ix)){
if(!is.na(cn[i]) & !is.na(cn[j]) & i!=j){
m <- round(matrix(c(
mcf[i]*depth[i],
depth[i]-mcf[i]*depth[i],
mcf[j]*depth[j],
depth[j]-mcf[j]*depth[j]),
ncol=2))
f <- try(fisher.test(m, alternative="greater")$p.value< 0.01 , silent=TRUE) ## Fisher test
if(class(f)!="try-error")
if(f & mcf[i] >= 1 - mcf[j]){ ## Pidgeonhole
precedence[as.character(lesions[i]),as.character(lesions[j])] <- precedence[as.character(lesions[i]),as.character(lesions[j])] + 1
l <- c(l, list(c(as.character(lesions[i]),as.character(lesions[j]))))
genesClonal[s, as.character(lesions[i])] <- 2
genesClonal[s, as.character(lesions[j])] <- 3
}
}
}
plist[[s]] <- l
}
t <- table(sapply(plist, length)>0)
pie(t, labels=paste(t, c("clonal/NA","polyclonal")), col=set1[2:1])
makeDesign <- function(I) {
w <- which(lower.tri(I), arr.ind=TRUE)
x <- matrix(0, nrow(w), nrow(I))
for(i in 1:nrow(w)){
x[i,w[i,1]] <- 1
x[i,w[i,2]] <- -1
}
return(x)
}
btModel <- function(I){
y <- cbind(I[lower.tri(I)], t(I)[lower.tri(I)])
x <- makeDesign(I = I)
glm.fit(x=x[,-1],y=y, family=binomial())
}
nCasesGene <- table(factor(unlist(sapply(plist, function(x) unique(unlist(x)))), levels=colnames(precedence)))
w <- which(nCasesGene > 5)
fit <- btModel(precedence[w,w]+.01)
## Warning: non-integer counts in a binomial glm!
c <- c(0,coef(fit))
names(c) <- colnames(precedence)[w]
o <- rank(c)
v <- pmin(2,sqrt(c(0,diag(chol2inv(fit$qr$qr)))))
l <- names(c)
m <- paste("n=",nCasesGene[w], sep="")
plot(-c, o, xlab="Relative time", yaxt="n", pch=19, col="grey", ylab="", xlim=range(-c+3*c(-v,v)))
segments(-c-v, o,-c+v,o, col="grey")
text(-c-v ,o,l, font=3, pos=2)
text(-c+v ,o,m, font=1, pos=4)
’ Here we generate a panel overview of all genetic lesions and their impact on outcome, split by clonal and subclonal status.
par(mfrow=c(8,8), mar=c(1.5,2.5,1.5,0.5), mgp=c(2,0.5,0), bty="L", xpd=TRUE, las=1, tcl=-0.2, cex.axis=1.25)
for(g in colnames(genesClonal)){
p <- try(pchisq(survdiff(osYr ~ genesClonal[,g] == 3, subset=genesClonal[,g]>0)$chisq,1,lower.tail=FALSE))
plot(survfit(osYr ~ factor(genesClonal[,g], levels=0:3)), col=set1[c(9,c(4,2,1))], mark=NA, xlim=c(0,5))
mtext(side=3, paste0(g, ifelse(class(p)!="try-error",mg14::sig2star(p),"")), line=0, font=4)
}
plot.new(); par(xpd=NA)
legend("topleft", col=set1[c(9,c(2,4,1))], lty=1, c("wt","clonal","indetermined","subclonal"), cex=1.5, bty="n")
plot.new(); par(xpd=NA)
legend("topleft", c(".","*","**","***", "P (0.05, 0.1]", "P (0.01, 0.05]", "P (0.001, 0.01]", "P < 0.001"), ncol=2, cex=1.5, bty="n", text.width= 0.1)
We use overall survival, measured from date of diagnosis, as the endpoint.
We implemented sparse random effects for the Cox proportional hazards model in the CoxHD R package available at [http://github.com/mg14/CoxHD]. This implementation can handle constant covariate and time-dependent models. The latter is important to quantify the effects of allografts, which are typically administered well after diagnosis. CoxHD::CoxRFX()
Let the hazard be: \[ \lambda = \lambda_0(t) \exp(u^T Z)\]
Define \(h=u^T Z\) as the log hazard. \(\lambda_0(t)\) is the normal baseline hazard in a coxph model.
The random effects model used here is an example of a hierarchical model with an additional assumption about the distribution of the parameters \(u\). We assume that these follow a normal distributions. This additional assumption leads to a ridge-type regularisation of the log-likelihood.
Let there be \(p\) covariates and \(\{g\}\) be a partitioning of the \(p\) variables into \(|g|\) groups. For each group assume that the parameters \(u_j\) are iid Normally distributed in each group:
\[\forall j \in g: u_j \sim \operatorname{N}(\mu_g;\sigma^2_g) \qquad iid. \]
The shared means \(\mu_g\) are motivated by the observation that the effect of oncogenic lesions is, on average, deleterious.
We use the convention that variables without indexes refer to the set of variables. In particular \(u = \{u_j: j=1,...,p\}\), \(u_g=\{u_j: j\in g\}\).
The full logarithmic likelihood reads:
\[\begin{align} \ell(u,\sigma^2,\mu;Z) &= \ell_0(u;Z) - \sum_g \frac{\sum_{j\in g}(u_j-\mu_g)^2}{\sigma^2_g} \cr &= \ell_0(u;Z) + \ell_2(u,\mu,\sigma^2). \end{align}\]
The term \(\ell_0(u)\) is the likelihood of an unpenalised coxph model. The second term is a sum of ridge penalties resulting from the constraints imposed by the normal distribution of \(u\), which penalises large values of \(u_j-\mu_g\) with strength \(1/\sigma_g\).
Note that the likelihood can be reparametrised by introducing the auxiliary variables \(z_g = \sum_{j\in g} Z_{.j}\) and the centred effects \(u_j = u_j - \mu_g\):
\[\ell(u,\sigma^2,\mu;Z) = \ell_0(u,\mu;Z,z) + \ell_2(u,\sigma^2) =: \ell(u, \mu,\sigma^2;Z)\]
All of the following steps are implemented in the CoxHD R package, available at [http://www.github.com/mg14/CoxHD] It can be installed using the devtools::install_github("mg14/CoxHD/CoxHD"). The implementation makes heavy use of the survival package (T. Therneau 2014). The implementation is about 100x faster than the coxme R package for mixed effects Cox models by T. Therneau (2012), as it exploits that \(u\) are iid.
We use an EM algorithm as suggested by Perperoglou (2014) for Cox models, based on the work by Schall (1991). The algorithm iteratively estimates the following quantities:
Given \(\hat\sigma^2\), jointly estimate
1.1. the shared means \(\hat\mu_g\) as the effect of the auxiliary variables \(z_g\).
1.2. the centred variables \(\hat u\) as a ridge estimate, \[\hat\mu, \hat u = \arg \max \ell(u, \mu,\hat\sigma^2;Z)\]
Given \(\hat\mu\) and \(\hat u\) the variances are estimated as: \[\hat\sigma_g^2 = \sum_{j\in g}\hat u_j^2/df_g, \qquad df_g = \operatorname{tr} [\mathcal{I_gg} H_{gg}^{-1}],\] where \(H\) is the Hessian matrix of the penalised model and \(\mathcal{I}\) the observed Fisher information of the unpenalised model (each evaluated for variables of group \(g\)).
Iterate until convergence of parameters and penalised likelihood.
The final parameters are given by uncentering \(\hat{u}_j = \hat u_j + \hat \mu_g\).
Note: There estimates \(\hat{u}\) are maximum a posteriori (MAP) from a Bayesian interpretation with \(\hat\sigma^2\) and \(\hat\mu\) being empirical Bayes estimates.
To assess the sampling distributions of our estimates, e.g., to assess the their variances, we use the following semi-parametric bootstrap approach:
For i=1:100 simulate \(n\) semiparametric survial times Y (see [#survival]):
This allows to assess the distribution of all estimates in a semi-parametric way.
Two estimates exists for the covariance matrices of the parameters \(\hat u\) and \(\mu\)(T. M. Therneau, Grambsch, and Pankratz 2003):
\(\hat V_1 = H^{-1}\)
\(\hat V_2 = H^{-1} \mathcal{I} H^{-1}\), where \(H\) is the Hessian matrix of the penalised model and \(\mathcal{I}\) the observed Fisher information of the unpenalised model. Semi-parametric bootstrap simulations show that \(V_2\) is more accurate in our context.
The estimates \(\hat V_\cdot\) have dimension \((p+|g|)\times(p+|g|)\).
The uncentered variance estimates of the parameter \(u_j = u_j + \mu_g\) are given by
\[\hat V_\cdot[u_j] = \hat V_\cdot[u_j,u_j] + \hat V_\cdot[\mu_g,\mu_g] + 2 \hat V_\cdot[u_j,\mu_g],\]
thus accounting for the correlation of \(u_j\) and \(\mu_g\).
Using variance estimate \(\hat V_2\), allows for computing a Wald-type test with one degree of freedom.
\[ \begin{align} z &= \hat{u}^2 / \hat V_2[u] \\\\ Z &\sim \chi^2_1 \end{align} \]
This is implemented as CoxHD::WaldTest()
P-values of each test are corrected for multiple testing. Due to dependence imposed by the shared distribution we use the Benjamini-Yektuieli method for controlling the false discovery rate (Q < FDR), implemented as p.adjust(x,method="BY").
Note: There exists a lively debate about how, and if at all, random effects shall be tested or not, see [http://glmm.wikidot.com/faq] or [https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q2/000743.html]. Here we use an approach outlined by Gray (1992), T. M. Therneau, Grambsch, and Pankratz (2003) and Wood (2013). However, it is important to check that the variances are correctly specified using a parametric bootstrap approach.
In an additive model the linear predictor of the log hazard \(h\) is given by:
\[h = u^T Z = \sum_g \sum_{j\in g} u_j^T Z_{.j} = \sum_g h_g\]
Where the set of \({g}\) is partitioning of all covariates. We define \(h_g\) as the partial logarithmic hazard contributed by group \(g\).
The variance of the logarithmic hazard is given by:
\[Var[h] = \sum_{g,h} Cov(h_g,h_h)\]
Taking just the diagonal elements of \(Cov(h_g,h_h)\) guarantees positive values, which do not necessarily add to the total variance. Using \(V_g = \sum_h Cov(h_h,h_g)\) yields additive variance components, albeit at the cost of being negative in cases with strong collinearity of the components.
Variance components are implemented as CoxHD::VarianceComponents().
Note: Unlike a classical mixed model \(V\) is not computed by marginalising the random effects, but by the MAP estimates. This can be seen as a first order approximation.
The standard deviation \(\sqrt{Var[h]}\) determines the average difference between any two patients in logarithmic hazard.
For a normally distributed hazard, the variance \(\sigma^2_h\) of the log hazard is related to the concordance metric (Gönen and Heller 2005)
\[C = \int \frac{1}{1+\exp(-|x|)}f(x;0,\sigma^2_h) dx\]
where \(f(x,\mu,\sigma_2)\) is the density of normal distribution. There exists no analytical solution to the above equation, but it may be computed numerically. For a variance of 1, the concordance is 72.5%.
Note: For a Cox proportional hazards model even perfect knowledge of the hazard does not guaratuee perfect concordance (i.e. C=1) due to the sampling of the survival times. The limit \(Var[h] \rightarrow \infty\), in which the hazard ratio between any two patients is infinite, yields a deterministic behaviour with \(C=1\).
The prediction error of a the log hazard for patient \(i\) is given by
\[\hat V[h_i] = V[\hat{u}^T Z_{i\cdot}] = Z_{i\cdot}^T \hat V[u] Z_{i\cdot}\]
where \(V[\hat{u}]\) is the covariance matrix of the parameters defined in Analytical confidence intervals.
Note: In a linear model, the lhs corresponds to the the residual \(r_i\) of observation \(i\) and the identity \(\hat V = Z^T Z \times RSS/n\) holds. In our case \(V\) is derived from the Fisher information, but it can be intuitive to think about the average prediction error \(\sum_i \hat V[h_i]/n\) as a pseudo residual variance.
To predict the log-hazard in the presence of missing variables, we can use the following imputation, leveraging the covariance in the training set:
Suppose that \(Z = (Z_o,Z_m)\), where \(Z_o\) are observed and \(Z_m\) missing parts of the data set. Suppose we know the means \(\mu\) and covariance \(\Sigma\). Then
\[E[Z_m] = \mu_m + \Sigma_{m,o} \Sigma_{o,o}^{-1} (Z_o - \mu_o)\]
\[V[Z_m] = \Sigma_{mm} - \Sigma_{mo} \Sigma_{oo}^{-1} \Sigma_{om} \]
The uncertainty in \(Z_m\) adds another term to the prediction error:
\[\hat V[h_i] = Z_{io}^T \hat V[u]_{oo} Z_{io} + u_m^T V[Z_{im}] u_m\]
Coxph + AIC or BIC forward and backward selection beginning from empty model. The implementation in the survival R package (T. Therneau 2014) handles constant covariate and time-dependent models.
Complementary pairs stability selection (CPSS) is an extension of the stability selection protocol, which combines subsampling and LASSO-regularised regression to obtain a robust subset of predictor variables (Meinshausen and Bühlmann 2010). Using complimentary pairs subsamples Shah and Samworth (2013) derived a tighter bound for error control.
We have recently used CPSS to analyse the association of genomic predictors and outcome in Myelodysplastic syndromes (Papaemmanuil et al. 2013). To this end, we have implemented CPSS in the CoxHD R package. Our implementation fits the CPSS model using the glmnet algorithm (J. Friedman, Hastie, and Tibshirani 2010, Simon et al. (2011)).
The algorithm CoxHD::CoxCPSS() uses the following parameters:
coxph() with selected variables for predictionsNote that the glmnet algorithm cannot handle time-dependent covariates.
Random survival forests are an intrinsically non-linear alternative to Cox proportional hazards based regression (Ishwaran et al. 2008). The idea is to fit an ensemble of regression trees based on subsampling of patients and/or covariates. The resulting predictions are averaged across the forest of regression trees. We used version 1.6 of the randomForestSRC package and default options for randomForestSRC::rfsrc(). Note that the model can only handle constant covariates.
Construct data.frame for OS, replicating patients (rows) before and after allograft.
dataFrameOsTD <- dataFrame[tplSplitOs,]
dataFrameOsTD[which(tplIndexOs), grep("TPL", colnames(dataFrameOsTD), value=TRUE)] <- 0 ## Set pre-tpl variables to zero
Define some indexes relating to subsets of variables used by the random effects model.
mainGroups <- grep("[A-Z][a-z]+[A-Z]",levels(groups), invert=TRUE, value=TRUE)
mainGroups
## [1] "CNA" "Clinical" "Demographics" "Fusions" "Genetics" "Nuisance" "Treatment"
mainIdx <- groups %in% mainGroups
osIdx <- !grepl("TPL", colnames(dataFrame)) ## Exclude TPL from OS analyses..
whichRFXOs <- which((colSums(dataFrame)>=8 | mainIdx) & osIdx) # ie, > 0.5%
mainIdxOs <- mainIdx & osIdx
osTDIdx <- !grepl("TPL_efs", colnames(dataFrame))
whichRFXOsTD <- which((colSums(dataFrame)>=8 | mainIdx) & osTDIdx) # ie, > 0.5%
mainIdxOsTD <- mainIdx & osTDIdx
whichRFXOsGG <- which((colSums(dataFrame)>=8 | mainIdxOs) & osIdx & groups %in% c(mainGroups,"GeneGene")) # ie, > 0.5%
Compute the number of oncogenics, excluding complex karyotype.
NONC <- rowSums(cbind(dataList$Cytogenetics[names(dataList$Cytogenetics)!="complex"], dataList$Genetics), na.rm=TRUE)
NONCs <- factor(ceiling(pmin(NONC,7)/2), labels=c("0","1-2","3-4","5-6","7+"))
c <- set1[c(3,2,4,1,5)]
f <- survfit(osYr ~ NONCs)
s <- summary(f)
plot(f, col=c, xlim=c(0,10),xlab="Years", ylab="Survival",mark="|", cex=.5)
legend('topright', bty='n', col=c, legend=paste0(levels(NONCs)," (n=",table(NONCs),")"), lty=1)
Fit a spline through continuous covariates
set.seed(42)
trainIdx <- sample(c(TRUE,FALSE), nrow(dataFrame), replace=TRUE, prob=c(0.66,0.34))
trainIdxOsTD <- trainIdx[tplSplitOs]
par(mfrow=c(3,3))
clinicalSpline <- as.data.frame(sapply(dataFrame[groups %in% c("Clinical","Demographics")], function(x){
if(all(x[1:5] %in% 0:10)) return(x)
y <- log(x+min(x)+1e-3)
fit <- coxph(os ~ pspline(y, df=3), subset=trainIdx)
predict(fit, newdata=data.frame(y=y))
}))
for(n in names(clinicalSpline)) if(!all(dataFrame[1:5,n] %in% 0:10))
plot(dataFrame[,n], clinicalSpline[,n], log='x', xlab=paste(n, '[observed]'), ylab = paste(n, '[spline]'))
## Warning in xy.coords(x, y, xlabel, ylabel, log): 5 x values <= 0 omitted from logarithmic plot
## Warning in xy.coords(x, y, xlabel, ylabel, log): 110 x values <= 0 omitted from logarithmic plot
summary(coxph(os ~ ., data=clinicalSpline, subset=!trainIdx))$concordance
## concordance.concordant se.std(c-d)
## 0.63780865 0.01772148
summary(coxph(os ~ ., data=dataFrame[groups %in% c("Clinical","Demographics")]), subset=!trainIdx)$concordance
## concordance.concordant se.std(c-d)
## 0.64297237 0.01003199
No measurable improvement over (scaled) linear terms thus.
Here we fit the random effects model using our implementation in the CoxHD package. First for main effects only.
coxRFXFitOsTDMain <- CoxRFX(dataFrameOsTD[,mainIdxOsTD], osTD, groups[mainIdxOsTD])
Now including gene:gene interaction terms (min. recurrence = 8)
whichRFXOsTDGG <- which((colSums(dataFrame)>=8 | mainIdxOsTD) & osTDIdx & groups %in% c(mainGroups,"GeneGene")) # ie, > 0.5%
coxRFXFitOsTDGGc <- CoxRFX(dataFrameOsTD[,whichRFXOsTDGG], osTD, groups[whichRFXOsTDGG], which.mu=mainGroups) ## allow only the main groups to have mean different from zero..
Compute Harrel’s concordance index
survConcordance(osTD~coxRFXFitOsTDGGc$linear.predictors)
## Call:
## survConcordance(formula = osTD ~ coxRFXFitOsTDGGc$linear.predictors)
##
## n= 1880
## Concordance= 0.7479676 se= 0.01003199
## concordant discordant tied.risk tied.time std(c-d)
## 710819.00 239515.00 0.00 432.00 19067.49
Here we compute the variance components.
colGroups <- c(brewer.pal(12, "Paired")[c(10)],brewer.pal(12, "Paired")[c(6,4,3,5,12,9,1,2,7)],"#999999", brewer.pal(12, "Paired")[c(8)])
colGroups <- colGroups[c(2:6,1,7:14)]
names(colGroups) <- levels(groups)[order(toupper(levels(groups)))]
PlotVarianceComponents(coxRFXFitOsTDGGc, col=colGroups)
title("Risk contributions OS (time-dep)")
Estimate confidence intervals by parametric bootstrap and compare with Wald Test. Note that the usual sample with replacement yields inconsistencies for the interaction terms due to the overdispersed correlations. The theoretical description of the survival time simulation is given in section survival.
set.seed(42)
risk <- as.matrix(dataFrame[whichRFXOsTDGG]) %*% coxRFXFitOsTDGGc$coefficients
risk <- risk - mean(risk)
parBoot <- mclapply(1:100, function(i) {
s <- SimSurvNonp(risk, os)
c <- try(CoxRFX(dataFrame[whichRFXOsTDGG], s, groups=groups[whichRFXOsTDGG], sigma0=0.1, nu=0))
if(class(c)=="try-error")
return(s)
c$Z <- NULL # set X to zero to save mem
return(c)
}, mc.cores=10)
Distributions of mean, sigma and df
boxplot(t(sapply(parBoot, `[[`, "sigma2")), border=colGroups[names(parBoot[[1]]$sigma2)], lty=1, pch=16, staplewex=0, ylab="sigma2", las=2, log="y", ylim=c(1e-3,1))
abline(h=0, lty=3)
points(coxRFXFitOsTDGGc$sigma2, pch=19)
boxplot(t(sapply(parBoot, `[[`, "mu")), border=colGroups[names(parBoot[[1]]$mu)], lty=1, pch=16, staplewex=0, ylab="mu", las=2)
abline(h=0, lty=3)
points(coxRFXFitOsTDGGc$mu, pch=19)
boxplot(t(sapply(parBoot, `[[`, "df")), border=colGroups[names(parBoot[[1]]$mu)], lty=1, pch=16, staplewex=0, ylab="df", las=2)
abline(h=0, lty=3)
points(coxRFXFitOsTDGGc$df, pch=19)
Coefficients
v <- apply(sapply(parBoot, `[[`, "coefficients"), 1, var, na.rm=TRUE)
w <- diag(coxRFXFitOsTDGGc$var) ## H^{-1}
w2 <- diag(coxRFXFitOsTDGGc$var2) ## H^{-1} I H^{-1}
c <- coef(coxRFXFitOsTDGGc)
plot(c^2/v, c^2/w, log="xy", xlab="Chi2 (bootstrap)", ylab="Chi2 (analyt.)", cex=.66)
par(xpd=NA)
points(c^2/v, c^2/w2, pch=16, cex=.7)
arrows(c^2/v, c^2/w, c^2/v,c^2/w2, length=0.05)
abline(0,1)
abline(h=qchisq(c(0.95,0.99,0.999), 1, lower.tail=TRUE), lty=c(1,2,3))
The plot indicates a good agreement of the variance estimate var2, see section 2.1.4.. Knowing the distribution of the variance allows us to compute a Wald test of the coefficients. #### Supplementary Table S1 Table with significance
library(DT)
library(htmlwidgets)
pBoot <- pchisq(c^2/v,1, lower.tail=FALSE)
pVar2 <- pchisq(c^2/w2,1, lower.tail=FALSE)
pVar <- pchisq(c^2/w,1, lower.tail=FALSE)
waldOut <- data.frame(group = groups[whichRFXOsTDGG],
`beta (log-hazard)`= c,
`hazard exp(beta)` = exp(c),
n = ifelse(groups[whichRFXOsTDGG] %in% c("CNA","Fusions","Genetics","GeneGene"), colSums(dataRaw[sub("_10*$","",names(whichRFXOsTDGG))], na.rm=TRUE), NA),
sd = sqrt(w2),
`sd (bootstrap)` = sqrt(v),
`sd (var)`= sqrt(w),
`P-value`=pVar2,
`Q (Benjamini-Yekutieli)` = p.adjust(pVar2, "BY"),
`Q (Benjamini-Hochberg)` = p.adjust(pVar2, "BH"),
check.names=FALSE
)
datatable(as.data.frame(lapply(waldOut, function(x) if(class(x)=="numeric") round(x,4) else x), check.names=FALSE, row.names=row.names(waldOut)))
library(xlsx)
wb <- createWorkbook("xlsx")
sheet <- createSheet(wb, sheetName="Overall survival")
addDataFrame(waldOut,
sheet,
colnamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE) + Border(),
rownamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE)
)
Volcano plot
par(mar=c(3,3,1,1)+.1, bty="n", mgp=c(2,.5,0))
i <- coxRFXFitOsTDGGc$groups %in% c("Genetics", "CNA","Fusions","GeneGene","Treatment")#apply(coxRFXFitOsTDGGc$Z,2,min) == 0 & apply(coxRFXFitOsTDGGc$Z,2,max) == 1
p <- pVar2 ## pvalues coxRFX
plot(c, 1/p, log='y', col=paste(colGroups[as.character(coxRFXFitOsTDGGc$groups)],"BB", sep=""), pch=ifelse(i,16,16), ylab="P-value",xlab="log hazard", cex=ifelse(i, sqrt(colMeans(coxRFXFitOsTDGGc$Z[!rev(duplicated(rev(tplSplitOs))),])*50),1), xlim=range(c*1.2))
## Warning in sqrt(colMeans(coxRFXFitOsTDGGc$Z[!rev(duplicated(rev(tplSplitOs))), : NaNs produced
#abline(h=qchisq(c(0.95,0.99,0.999), 1, lower.tail=TRUE), lty=c(1,2,3))
w <- which(p.adjust(p,"BY") < 0.1)
points(c[w], 1/p[w], pch=1, cex=ifelse(i[w], sqrt(colMeans(coxRFXFitOsTDGGc$Z[!rev(duplicated(rev(tplSplitOs))),w])*50),1))
w <- which(p.adjust(p,"bonf") < 0.05)
par(xpd=NA)
text(c[w], 1/p[w], names(c[w]), pos=3)
u <- par("usr")
f <- c(0.01,0.05,0.1,0.2,0.5)
s <- sqrt(f*50)
legend("topright",legend=f, pch=16, pt.cex=s, bty='n', col=paste("#88888888"))
par(xpd=FALSE)
abline(h=1/0.05, lty=2)
abline(h=1/max(p[which(p.adjust(p,"BY") < 0.1)]), lty=3)
P-values and random model
set.seed(42)
Z <- apply(coxRFXFitOsTDGGc$Z, 2,sample)[1:nrow(dataFrame),] ## random covariates
coxRFXFitOsRain <- CoxRFX(Z, os, groups=coxRFXFitOsTDGGc$groups, nu=1) ## model
w2 <- diag(coxRFXFitOsRain$var2)
c <- coef(coxRFXFitOsRain)
p2 <- pVar2
plot(seq(0,1,l=length(p2)+1)[-1],sort(p2), xlab="P-value (expected)", ylab="P-value (observed)", pch=16, col="grey")
abline(0,1)
points(seq(0,1,l=length(p)+1)[-1],sort(p), pch=16)
legend("topleft",bty="n", c("observed","randomised"), pch=16, col=c("black","grey"))
Distribution of the variance components
v <- t(sapply(parBoot, function(x) {t <- try(VarianceComponents(x, newZ=dataFrame[whichRFXOsTDGG])); if(class(t)=="try-error") rep(NA, nlevels(x$groups)+1) else t}))
boxplot(v, border=colGroups[colnames(v)], lty=1, pch=16, staplewex=0, ylab="variance comp.", las=2)
abline(h=0, lty=3)
points(VarianceComponents(coxRFXFitOsTDGGc), pch=19)
rm(parBoot)
Plot of log hazard v outcome
par(mar=c(3,3,3,1), mgp=c(2,.5,0))
t <- os
s <- survfit(os~1)
q <- quantile(t[,1], seq(0,1,.1))# q <- splinefun( s$surv, s$time,"monoH.FC")(seq(1,min(s$surv),l=10))
c <- cut(t[,1], q, na.rm=TRUE)
h <- coxRFXFitOsTDGGc$linear.predictors[rev(!duplicated(rev(tplSplitOs)))][order(tplSplitOs[rev(!duplicated(rev(tplSplitOs)))])]
o <- order(h)
plot(h[o], col= (brewer.pal(10,'RdBu'))[c[o]], type='h', xaxt="n", xlab='Patient', las=2, ylab="log hazard")
u <- par("usr")
q <- pmin(q,365*12)
image(x=q/max(q)*500, y=c(u[4]-(u[4]-u[3])/20, u[4]), matrix(1:10), col= (brewer.pal(10,'RdBu')), add=TRUE)
#axis(side=3, at=seq(1,500,l=11), labels=seq(0,1,.1))
axis(side=3, at=pretty(q/365)/max(q)*365*500, labels=pretty(q/365))
lines(ksmooth(seq_along(o),t[o,2]==0, bandwidth=50))
Risk constellation plots using the stars() function
set.seed(42)
library(HilbertVis)
nStars <- 32
s <- sample(nrow(dataFrame),nStars^2) #1:(nStars^2)
l <- "coxRFXFitOsTDGGc"
t <- os#get(l)$surv
p <- PartialRisk(get(l), newZ=dataFrame[, colnames(get(l)$Z)])
p <- p[,colnames(p)!="Nuisance"]
locations <- 1.5*hilbertCurve(log2(nStars)) #2*expand.grid(1:nStars,1:nStars)
h <- hclust(dist(p[s,]))
x <- p - rep(colMeans(p), each=nrow(p))
x <- x/(2*sd(x)) + 1
c <- cut(t[s,1][h$order], quantile(t[,1], seq(0,1,0.1), na.rm=TRUE))
if(l=="coxRFXFitOsTDGGc")
x <- x[,c("Demographics","Treatment","Fusions","CNA","Genetics","GeneGene","Clinical")]
mg14:::stars(x[s,][h$order,]/2, scale=FALSE, locations=locations, key.loc=c(0,-3), col.lines=ifelse(t[s,2][h$order],1,NA), col.stars = (brewer.pal(11,'RdBu'))[c], density=ifelse(t[s,2][h$order],NA,NA))
symbols(locations[,1], locations[,2], circles=rep(.5,(nStars^2)), inches=FALSE, fg="grey", add=TRUE, lty=1)
title(main=l)
Randomly select 5 patients according to their genotype/outcome
patients <- c(
which(dataFrame$`TP53`==1 & dataFrame$complex==1 & os[,1] < 300 & os[,2]==1)[1],
which(dataFrame$`NPM1:FLT3_ITD:DNMT3A`==1 & os[,1] < 300 & os[,2]==1)[1],
which(dataFrame$SFRS2==1 & clinicalData$WHOcat=='no' & os[,2]==1)[1],
which(dataFrame$NPM1==1 & dataFrame$FLT3_ITD==0 & dataFrame$DNMT3A==0 & os[,1] > 2000)[1],
which(dataFrame$t_15_17==1 & os[,1] > 2000)[1]
)
genotype <- apply(dataFrame[groups %in% c("Fusions","CNA","Genetics")]==1, 1,function(x) paste(names(which(x)), collapse=";"))
t <- os
p <- PartialRisk(coxRFXFitOsTDGGc, newZ=dataFrame[, whichRFXOsTDGG])
p <- p[,colnames(p)!="Nuisance"]
locations <- 1.5*hilbertCurve(log2(nStars)) #2*expand.grid(1:nStars,1:nStars)
x <- p - rep(colMeans(p), each=nrow(p))
x <- x/(2*sd(x)) + 1
c <- cut(t[patients,1], quantile(t[,1], seq(0,1,0.1), na.rm=TRUE))
x <- x[patients,c("Demographics","Treatment","Fusions","CNA","Genetics","GeneGene","Clinical")]
locations <- expand.grid(seq_along(patients)* 1.5, 1)
mg14:::stars(x/2, scale=FALSE, locations=locations, key.loc=NA, col.lines=ifelse(t[patients,2],1,NA), col.stars = (brewer.pal(11,'RdBu'))[c])
symbols(locations[,1], locations[,2], circles=rep(.5,length(patients)), inches=FALSE, fg="grey", add=TRUE, lty=1)
text(locations[,1], locations[,2]-1,labels=clinicalData$PDID[patients], pos=1)
l <- apply(dataFrame[patients,c("gender","AOD_10","TPL_os","wbc_100")], 1,paste, collapse=";")
par(xpd=NA)
text(locations[,1], locations[,2]+1,labels=paste(gsub(";","\n",genotype[patients]),l, paste(round(os[patients,1],2), osYr[patients,2]), sep="\n"), pos=3)
We use the following nomenclature: \(f(T=t)=f(t)\) denotes a probability density, \(F(T=t)=P(T<t)=F(t)\) the corresponding cumulative distribution function. \(S(T=t)=1-F(t)\) is the survivor function, the name being motivated by the situation that \(t\) is a death time. In cases where it is clear to which variable a (cumulative) density refers to, we may drop the stochastic variable and simply use its value as the argument, \(f(t)=f(T=t)\). We use the convention of lower case variables \(t\) to denote the values of the corresponding upper case stochastic variable \(T=t\), \(U=u\) and so on. For a categorical stochastic process \(X_t\), \(t\in \mathbb{R}^+\) we use the symbol \(P(X_t)\) to denote the probability distribution at time \(t\). The symbol \(Z\) denotes the covariates.
We use a hierarchical multistage model to quantify the rates at which a patient progresses from one disease/treatment stage to another (Figure 1A). After learning the marginal time-dependent transition probabilities for each event, we can combine these into a time-dependent joint probability.
In particular, we model the following transition times:
CR and CIR and midpoints, allowing for further events, NCD, NCD and PRD are endpoints. Due to the hierarchical nature of the model, only one endpoint can ever occur and midpoints are transient.
The probability to be in a given state is then given by the combination of event times, such that
To be alive in CR at time t, for example, requires that CR occurred before t, CR was achieved before NCD, and neither relapse nor NCD have occurred yet. Overall, a patient can only be in one of the following six states at time t, each corresponding to a particular ordering of event times:
| Stage | Abbreviation | Ordering of times | Symbol |
|---|---|---|---|
| Alive in induction | AI | \(t < T_{CR}, T_{NCD}\) | \(\mathcal{I}_{AI}(t)\) |
| Death without complete remission | NCM | \(T_{NCD} < t\); \(T_{NCD} < T_{CR}\) | \(\mathcal{I}_{NCM}(t)\) |
| Alive in complete remission | ACR | \(T_{CR} < T_{NCD}\); \(t < T_R, T_{NCD}\) | \(\mathcal{I}_{ACR}(t)\) |
| Death without relapse | NRM | \(T_{CR} < T_{NCD}\); \(T_{NRD} < t < T_R\) | \(\mathcal{I}_{NRM}(t)\) |
| Alive after relapse | AR | \(T_{CR} < T_{NCD}\); \(T_R < T_{NRD}\); \(t < T_{PRD}\) | \(\mathcal{I}_{AAR}(t)\) |
| Death after relapse | PRM | \(T_{CR} < T_{NCD}\); \(T_R < T_{NRD}\) ; \(T_{PRD} < t\) | \(\mathcal{I}_{PRM}(t)\) |
This defines a stochastic process \(X_t\) on the given set of six states, \(X_t \in \{AI, ACR, AR, NCM, NRM, PRM\}\). Initially, all patients will be alive in induction, \(X_0=AI\).
The hierarchical nature of the model implies that the joint probability of event time factorises
\[ f(T_{CR},T_{NCD}, T_{R}, T_{NRD}, T_{PRD}) = f(T_{CR}) \times f(T_{NCD}) \times f(T_R \mid T_{CR})\times f(T_{NRD}\mid T_{CR})\times f(T_{PRD} \mid T_R). \label{eq:joint-density}\]
The above factorisation lays out a strategy in which each of the 5 factors may be estimated separately. The probability of each state \(P(X_t)\), defined in section states, are then computed by integrating the joint density \(f\), Eq.\(\eqref{eq:joint-density}\) over the simplexes \(\mathcal{I}_{\cdot}(t)\) defining a particular ordering of transitions detailed in table 2: \[ P(X_t = x) = \iiiint\!\!\!\!\!\int_{\mathcal{I}_x(t)} f(t_{CR},t_{NCD}, t_{R}, t_{NRD}, t_{PRD})\ dt_{CR}\ dt_{NCD}\ dt_{R}\ dt_{NRD}\ dt_{PRD}. \label{eq:mult-prob}\]
The integral can be successively evaluated as described below.
To estimate the population average transition probabilities and absolute incidence of each each individual stage we use the msSurv R package (Ferguson, Datta, and Brock 2012). The resulting time-dependent joint distribution \(P(X_t)\) is shown in Figure 1B.
To estimate how each transition \(T\) depends on the set of variables \(Z\) introduced in section variables, we use a random effects model for each transition to obtain \(f(T\mid Z)\). Competing events are considered to be censored. We apply a separate random effects model to estimate all five terms in Eq.\(\eqref{eq:joint-density}\).
The estimation of the unconditional densities \(f(T=t\mid Z)\) is straightforward. The random effects model yield the marginal survivor function \(S(t \mid Z) = S_0(t) ^{\exp(u Z)}\), quantifying the hypothetical scenario that there were no competing events \(T'\). From \(S(t\mid Z)\) we can derive the marginal densities \(f(t \mid Z) = -dS(t)/dt = -\exp(u Z) S_0(t)^{\exp(u Z) -1} dS_0(t)/dt\) for each transition \(T\) given the covariates \(Z\). The Kaplan-Meier estimate of \(S_0(t)\) is a step function, so we may compute \(f(t\mid Z)\) via a numerical differentiation.
To estimate the conditional densities of the type \(f(U=u\mid Z, T)\), we use the following approach:
\[S(u \mid Z, T=t) = S_0(u-t \mid Z) ^{\exp(g(t))} = S_0(u-t) ^{\exp(u Z + g(t))}. \label{eq:cond-dens}\]
This allows us to estimate the incidence of each event from the beginning of each stage \(S_0(u-t \mid Z)\) and express the time-dependence as a smooth function \(g(t)\). For example, the duration of CR1 is a prognostic factor for post-relapse mortality, e.g. (Burnett et al. 2013).
The above corresponds to a Cox proportional hazards model with a time-dependent smooth covariate g(t).
Here we estimate \(g(t)\) with a spline term with 10 degrees of freedom. We estimate \(S_0(u-t)\) and \(u\) using a random effects model and subsequently estimate \(g(t)\) using \(u Z\) as an intercept:
fit_u_minus_t_given_Z <- CoxRFX(Z, Surv(U-T, event))
beta <- coef(fit_u_minus_t_given_Z)
fit_u_given_Z_t <- coxph(Surv(U-T, event) ~ I(beta %*% Z) + spline(T, df=10))
We may thus obtain \(S(u\mid T=t, Z) = \left(S_0(u-t)^{\exp(u Z)}\right)^{\exp(g(t))}\) by offsetting the baseline hazard \(S_0(u-t)\) and exponentiating for the effect of covariates \(u Z\) and exponentiating for the effect of time-dependence.
The absolute probability to be in state \(U\) is given by integrating over the conditional probabilities \(S(u\mid T=t, Z)\), weighted by the probability of the preceding event probabilities \(f(T=t\mid Z)\):
\[P(U < u \mid Z) = \int_0^t f(T=t \mid Z) \int_t^u f(U=v \mid T=t, Z) dt dv = \int_0^u f(T=t \mid Z) F(u \mid T=t, Z) dt \label{eq:cond-prob}\]
With $F(uT=t, Z) = 1-S(uT=t, Z) $, we can use the above definition to numerically solve the above integral.
The pseudo code for this is given by:
S0_given_Z <- S0 ^ exp(beta %*% Z)
gt <- predict(fit_u_given_t, data=data.frame(T=1:length(u)))
ft <- -diff(St)
for(t in 1:length(u)){
Fu_given_Zt <- 1-S0_given_Z[-(1:t)] ^ exp(gt)
Pu <- cumsum(ft * Fu_given_Zt)
}
In cases of competing events (CR and NCD; NCD and CIR), we use a competing risk adjustment between two event times \(T\), \(U\), to obtain \[S(T=t \mid Z, T < U) = \int_t^v \int_v^\infty f(T=t'\mid Z) f(U=u' \mid Z) dt' du' = \int_0^t f(T=t' \mid Z) S(U=t'\mid Z) dv.\]
In practical terms, \(S(t\mid Z) = S_0(t) ^{\exp(u Z)}\) denotes the survivor function estimated by the Kaplan-Meyer estimate \(S_0(t)\), exponentiated by the hazard \(\exp(u Z)\). The differential \(f(t\mid Z)\) is obtained by evaluating the difference of \(S(t+1\mid Z) - S(t\mid Z)\) at intervals of length 1 day, pseudo code
S_t_cr <- cumsum(diff(S_t) * S_u)
| Endpoint | Censored | Model | Competing | Interval | Time-dependency |
|---|---|---|---|---|---|
| Complete remission | Non-remission death | CoxRFX | Non-remission death | From ER | |
| Non-remission death | Complete remission | CoxRFX | Complete remission | From ER | |
| Relapse | Non-relapse death | CoxRFX | Non-relapse death | From CR1 | Time to CR1 |
| Non-relapse death | Relapse | CoxRFX | Relapse | From CR1 | Time to CR1 |
| Post-relapse death | Last follow up | CoxRFX | - | From relapse | Duration of CR1 |
As the density \(\eqref{eq:joint-density}\) factorises we can successively evaluate each term, beginning with the first transition. The probability to be in a given state is then computed according to the rules outlined in the previous two subsections.
The probability is given by a simple competing risk adjustment between T_{CR} and T_{NCD}:
\[ P(X_t = NCM) \mid Z) = P(T_{NCD} < t, T_{NCD} < T_{CR} \mid Z) = 1 - \int_0^t f(T_{NCD} = u \mid Z) F(T_{CR} = u\mid Z) du.\]
We first compute the probability that CR is achieved, irrespective of the subsequent events, using a competing risk adjustment with T_{NCD}:
\[P(X_t = CR \mid Z) = P(T_{CR} < T_{NCD}, T_{CR} < t\mid Z) = 1 - \int_0^t f(T_{CR} = u \mid Z) F(T_{NCD} = u\mid Z) du ,\]
where \(CR = \{ACR \cup NRD \cup AAR \cup PRD\}\), which can then be further subdivided according to the possible subsequent events. To be alive, neither relapse nor non-relapse death may have occurred.
The probability to be alive in induction is given by \[P(X_t = AAI\mid Z) = 1-P(X_t = CR\mid Z) -P(X_t=NRD\mid Z).\]
The probability of non-relapse deaths \(P(X_t = NRM \mid Z)\) is computed in the following way. We first estimate transition rates for non-relapse deaths and relapses, \(f(T_{NRM} \mid Z, T_{CR})\)
and \(f(T_{NRM} \mid Z, T_{CR})\), as outlined in conditional-densities. Instead of the marginal density \(f(T_{CR} \mid Z)\) we use the differential of \(dP(X_t = CR \mid Z) / dt\) in Eq.\(\eqref{eq:cond-prob}\).
As only one of the two events can ever occur we then use a competing risk adjustment to obtain the absolute probability probability of \(P(X_t = NRD \mid Z)\) and \(P(X_t = R \mid Z)\), respectively, where \(R = \{ AAR \cup PRM \}\) denotes a relapse.
The probability to be alive in first complete remission equals the probability of neither dying nor relapsing: \[P(X_t=ACR) = 1 - P(X_t = NRD) - P(X_t = R). \]
The probability of post-relapse deaths \(P(X_t = PRM \mid Z)\) is computed as described in conditional-densities. We first estimate the rate for post-relapse deaths and relapses, \(f(T_{PRM} \mid Z, T_{R})\), with the derivative of \(dP(X_t = R \mid Z) / dt\) in Eq.\(\eqref{eq:cond-prob}\).
Finally, the probability to be alive after relapse is given by \[ P(X_t = AAR \mid Z) = P(R\mid Z) - P(PRM\mid Z).\]
For each predicted variable we can derive 95% confidence intervals from the prediction error of the log hazard, \((h_{0.025},h_{0.975})\approx h + (-2,2) \times \hat V[h \mid Z]\), with \(V[h\mid Z]\). This translates to the survival function as follows using the log-log approach: \[S_{0.025}(t \mid Z) = S_0(t)^{\exp(h_{0.025})}\] \[S_{0.975}(t \mid Z) = S_0(t)^{\exp(h_{0.975})}.\]
Note that this does not model the error of the baseline survival estimate \(S_0(t)\).
Let the symbol PCS denote post remission survival. In the following sections all quantities are conditional on the data \(Z\).
Analytical confidence intervals can be calculated using a the propagation of errors based on a Taylor expansion of the PCS probability: \[\begin{align} V[h_{PCS}] &\approx \sum_i \left(\frac{\partial h_{PCS}}{\partial h_i}\right)^2 V[h_i] \\ h_{PCS} &= \log\log P_{PCS} + \log\log P_0(t)\\ \frac{\partial h_{PCS}}{\partial h_i} &= \frac{\partial \log\log P_{PCS}}{\partial h_i} \\ &= \frac{1}{P_{PCS}\log(P_{PCS})} \frac{\partial P_{PCS}}{\partial h_i} \end{align} \]
To facilitate an efficient computation of the derivatives of \(P_{PCS}\), which is given by the integrals above, we use the pointwise approximation: \[\begin{align} P_{PCS} &\approx S_{NRD}(1 - (1-S_{R})(1- S_{PRD})) = S_{PCS}\\ \end{align}\] where \(S_\cdot\) denote the Kaplan-Meyer estimates of the survival probabilities.
The partial derivative of the loglog is given by \[\frac{\partial \log\log S}{\partial x} = \frac{1}{\log(S)} \frac{1}{S} \frac{\partial S}{\partial x}.\]
So the variance of the loglog overall survival reads: \[V[h_{PCS}]\approx \frac{1}{(S_{PCS}\log S_{PCS})^2} \left( V[S_{NRD}](1-(1-S_{R})(1- S_{PRD}))^2 + V[S_{R}] S_{NRD}^2 (1-S_{PRD})^2 + V[S_{PRD}] S_{NRD}^2 (1-S_{R})^2 \right )\] with \[V[S_i] = \left(\frac{\partial S_0(t)^{\exp(h)}}{\partial h}\right)^2 V[h] = (S_i \log S_i)^2 V[h]. \]
This allows to define 95% confidence intervals of \(h_{PCS}\) as: \[h_{PCS}^{0.975} = h_{PCS} + 2V[h_{PCS}].\]
This translates to an overall survival: \[P_{PCS}^{0.975} = \exp \exp(h_{PCS}^{0.975}) = \exp (\exp(h_{PCS}) \exp( 2V[h_{PCS}]) ) = P_{PCS} ^{\exp(2 V[h_{PCS}])}\]
Note that in the last step uses the competing risk and time-adjusted estimate \(P_{PCS}\) again
A more accurate account comes from simulations of errors in the predicted log hazard. The cumulative survival functions are given by \[S{_\cdot}(t) = S_{0\cdot}(t)^{\exp(h_\cdot + \epsilon_\cdot)}\] So drawing \[\epsilon_\cdot \sim N(0,\hat V[h_\cdot])\] for each event type and repeating the computations outlined in [combined-os] yields an empirical distribution of the survival distribution of \(S_{PCS}(t)\).
We use i=200 simulations to compute the empirical confidence intervals.
Note: In all cases the prediction errors are assumed to be independent.
Confidence intervals for differential survival, e.g. with and without allograft are computed as in the previous section. A complication arises as the errors are correlated. We hence sample errors for all common variables and then sample those variable that differ. This approach allows to assess the uncertainty resulting from a subset of variables, and on the background of the joint variation in the set of common features.
We use a numerical approach similar to the one outlined above to compute the confidence intervals for overall survival measured from diagnosis. Note that it is in principle also possible to derive analytical confidence intervals analogous to section 5.4.2.1.
Multi-state using msSurv (Ferguson, Datta, and Brock 2012).
library(msSurv)
d <- sapply(1:nrow(clinicalData), function(i){
i <<- i
t <- c(as.numeric(clinicalData[i,c("CR_date","Recurrence_date","Date_LF")]) - as.numeric(clinicalData$ERDate[i]))
o <- order(t, na.last=NA)
stages <- c(1:3,0)
r <- stages[c(1, o+1)]
if(clinicalData$Status[i])
r[length(r)] <- r[length(r)-1] +3
tt <- c(0,t[o])
if(length(o)==0)
return(c(rep(NA,7),i))
s <- cbind(id=i, stop=tt[-1], start.stage=r[-length(r)], end.stage=r[-1])[diff(tt)!=0,]
#s <- cbind(time1 = tt[-length(tt)], time2=tt[-1], death=c(rep(0, length(o)-1), clinicalData$Status[i]), outer(0:(length(o)-1), r[-3], `>=`)+0, i=i)[diff(tt)!=0,]
return(s)
})
d <- as.data.frame(do.call("rbind",d))
nodes <- as.character(1:6)
edges <- list(`1`=list(edges=c("2","4")), `2`=list(edges=c("3","5")), `3`=list(edges="6"), `4`=list(edges=NULL), `5`=list(edges=NULL),`6`=list(edges=NULL))
struct <- new("graphNEL", nodes = nodes, edgeL = edges, edgemode = "directed")
msurv <- msSurv(d, struct, bs = FALSE)
##
## Entry distributions calculated for states 2 3 4 5 6 .
##
## Exit distributions calculated for states 1 2 3 .
y <- t(apply(cbind(1,-msurv@ps[,c(4:6, 3:1)]),1,cumsum))
par(mar=c(3,3,1,1), bty="n", mgp=c(2,.5,0), las=1)
plot(msurv@et/365.25, y[,1], ylim=c(0,1), type="s",lty=0, xlab="Time after diagnosis", ylab="Fraction of patients", xlim=c(0,10), xaxs="i", yaxs="i")
steps <- function(x, type="s") rep(x, each=2)[if(type=="s") -1 else -2*length(x)]
x <- steps(msurv@et/365.25, type="S")
for(i in 1:6)
polygon(c(x, rev(x)), c(steps(y[,i]), rev(steps(y[,i+1])) ), col=c(brewer.pal(5,"Pastel1")[c(1:3,5,4)],"#DDDDDD")[i], border=NA)
abline(h=seq(0,1,.2), col='white', lty=3)
abline(v=seq(0,10,1), col='white', lty=3)
lines(x, steps(y[,4]), lwd=2)
w <- which.min(abs(msurv@et/365.25-10))
text(x=par("usr")[2], y= y[w,-7]+diff(y[w,])/2, labels=c("early death","death in CR","death after relapse","alive with relapse","alive in remission","induction/LOF"), pos=2)
Times for allografts pre and post relapse, after 1CR only
alloIdx <- clinicalData$TPL_type %in% c("ALLO","FREMD") # only allografts
alloTimeCR1 <- clinicalData$Time_1CR_TPL + .5 # +.5 to make > 0
alloTimeCR1[!alloIdx | (clinicalData$TPL_date < clinicalData$Recurrence_date & !clinicalData$TPL_Phase %in% c("CR1","RD"))] <- NA
Create data frames for each phase
whichRFXRel <- whichRFXOsTDGG[grep("TPL",names(whichRFXOsTDGG), invert=TRUE)] #mainIdx & !grepl("TPL", names(dataFrame)) & groups!="Nuisance"
t <- clinicalData$Recurrence_date
t[is.na(t)] <- as.Date(1e6, origin="2000-01-01")
relData <- MakeTimeDependent(dataFrame[whichRFXRel], timeEvent=alloTimeCR1, timeStop=as.numeric(pmin(t, clinicalData$Date_LF) - clinicalData$CR_date), status=!is.na(clinicalData$Recurrence_date)+0)
relData$transplantCR1 <- relData$event
relData$event <- NULL
relData$transplantRel <- 0
nrdData <- MakeTimeDependent(dataFrame[whichRFXRel], timeEvent=alloTimeCR1, timeStop=as.numeric(pmin(t, clinicalData$Date_LF) - clinicalData$CR_date), status=is.na(clinicalData$Recurrence_date) & clinicalData$Status)
nrdData$transplantCR1 <- nrdData$event
nrdData$event <- NULL
nrdData$transplantRel <- 0
alloTimeRel <- clinicalData$TPL_date - clinicalData$Recurrence_date + .5 # +.5 to make > 0
alloTimeRel[!alloIdx | (clinicalData$TPL_date < clinicalData$Recurrence_date & !clinicalData$TPL_Phase %in% c("CR1","RD"))] <- NA
i <- !is.na(clinicalData$Recurrence_date)
prdData <- MakeTimeDependent(dataFrame[i,whichRFXRel], timeEvent=alloTimeRel[i], timeStop=as.numeric(clinicalData$Date_LF- clinicalData$Recurrence_date)[i], status=clinicalData$Status[i])
prdData$transplantCR1 <- rep(0,nrow(prdData))
w <- sub("\\.1","",rownames(relData))[relData$status==1 & relData$transplantCR1==1]
prdData$transplantCR1[sub("\\.1","",rownames(prdData)) %in% w] <- 1
prdData$transplantRel <- prdData$event
prdData$event <- NULL
w <- which(prdData$time1 == prdData$time2) ## 5 cases with LF=Rec
prdData$time2[w] <- prdData$time2[w] + .5
prdData$time0 <- as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index]
crGroups <- c(as.character(groups[whichRFXRel]), "Treatment","Treatment")
names(crGroups) <- c(names(dataFrame)[whichRFXRel],"transplantCR1","transplantRel")
coxRFXNrdTD <- CoxRFX(nrdData[names(crGroups)], Surv(nrdData$time1, nrdData$time2, nrdData$status), groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
coxRFXNrdTD$coefficients["transplantRel"] <- 0
#prsData$time1[!is.na(prsData$time1)] <- 0
coxRFXPrdTD <- CoxRFX(prdData[names(crGroups)], Surv(prdData$time1, prdData$time2, prdData$status), groups=crGroups, nu=1, which.mu = intersect(mainGroups, unique(crGroups)))
coxRFXRelTD <- CoxRFX(relData[names(crGroups)], Surv(relData$time1, relData$time2, relData$status), groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
coxRFXRelTD$coefficients["transplantRel"] <- 0
osData <- MakeTimeDependent(dataFrame[whichRFXRel], timeEvent=alloTimeCR1, timeStop=as.numeric(clinicalData$Date_LF- clinicalData$CR_date), status=clinicalData$Status)
osData$transplantCR1 <- osData$event
osData$transplantRel <- osData$event
w <- which(clinicalData$TPL_date > clinicalData$Recurrence_date | clinicalData$TPL_Phase != "CR1")
osData$transplantCR1[osData$index %in% w] <- 0
osData$transplantRel[!osData$index %in% w] <- 0
data <- osData[rev(!duplicated(rev(osData$index))),colnames(coxRFXRelTD$Z)]
osData$transplantRel <- 0 # Note: confounded by relapse
rownames(data) <- sub("\\.1$","", rownames(data))
data <- data[rownames(dataFrame),]
coxRFXOsCR <- CoxRFX(osData[names(crGroups)], Surv(osData$time1, osData$time2, osData$status), groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
table(CR=!is.na(clinicalData$CR_date), os[,2])
##
## CR 0 1
## FALSE 7 264
## TRUE 629 640
c <- as.numeric(clinicalData$CR_date - clinicalData$ERDate)
c[is.na(c)] <- clinicalData$OS[is.na(c)]
cr <- Surv(c, factor(pmin(2 * (!is.na(clinicalData$CR_date))+os[,2],2), levels=0:2, labels=c("cens","ED","CR")), type="mstate")
coxRFXCrTD <- CoxRFX(osData[1:1540, names(crGroups)], Surv(cr[,1], cr[,2]==2), groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
coxRFXNcdTD <- CoxRFX(osData[1:1540, names(crGroups)], Surv(cr[,1], cr[,2]==1), groups=crGroups, which.mu = NULL)
par(mfrow=c(3,2), xpd=FALSE)
o <- c(1,4,6,5,2,3,7,8)
PlotVarianceComponents(coxRFXNcdTD, col=colGroups, order=o)
title(main="Early deaths")
PlotVarianceComponents(coxRFXCrTD, col=colGroups, order=o)
title(main="Remission")
PlotVarianceComponents(coxRFXRelTD, col=colGroups, order=o)
title(main="Relapse")
PlotVarianceComponents(coxRFXNrdTD, col=colGroups, order=o)
title(main="Non-relapse deaths")
PlotVarianceComponents(coxRFXPrdTD, col=colGroups, order=o)
title(main="Post-relapse deaths")
As barplot
par(mar=c(4,3,1,5))
allVarComp <- sapply(c("NcdTD","CrTD","NrdTD","RelTD","PrdTD"), function(x){
m <- get(paste0("coxRFX",x))
Z <- get(sub("\\[.+","",as.character(m$call["data"])))
i <- if(x%in%c("CrTD","EsTD")) 1:1540 else Z$index
VarianceComponents(m, newZ=Z[!rev(duplicated(rev(i))),colnames(m$Z)])})
colnames(allVarComp) <- c("Early deaths","Remission","Non-relapse d.","Relapse","Post-relapse d.")
w <- c("CNA","Fusions","Genetics","GeneGene","Clinical","Demographics","Treatment","Nuisance")
z <- allVarComp[w,]#/rep(colSums(allVarComp[-9,]), each=8)
b <- barplot(z, col=colGroups[w], ylab="Variance [log hazard]", names.arg=rep("",ncol(z)))
rotatedLabel(x0=b, labels=colnames(z))
Z <- rbind(0,apply(z,2,cumsum))
n <- ncol(z)
segments(b[-n]+.5,t(Z[,-n]),b[-1]-.5 ,t(Z[,-1]))
z <- allVarComp[w,]/rep(colSums(allVarComp[-9,]), each=8)
b <- barplot(z, col=colGroups[w], ylab="Relative importance", names.arg=rep("",ncol(z)))
rotatedLabel(x0=b, labels=colnames(z))
Z <- rbind(0,apply(z,2,cumsum))
n <- ncol(z)
segments(b[-n]+.5,t(Z[,-n]),b[-1]-.5 ,t(Z[,-1]))
mtext(side=4, at=Z[-1,n] - diff(Z[,n])/2, text=rownames(Z)[-1], las=2)
v <- c(1,3,5,4,2)
z <- allVarComp[w,v]/rep(colSums(allVarComp[-9,v]), each=8)
b <- barplot(z, col=colGroups[w], ylab="Relative importance", names.arg=rep("",ncol(z)))
rotatedLabel(x0=b, labels=colnames(z))
Z <- rbind(0,apply(z,2,cumsum))
n <- ncol(z)
segments(b[-n]+.5,t(Z[,-n]),b[-1]-.5 ,t(Z[,-1]))
mtext(side=4, at=Z[-1,n] - diff(Z[,n])/2, text=rownames(Z)[-1], las=2)
Pairwise scatter plots of the log hazard for each transition
allStagesRisk <- as.data.frame(sapply(c("NcdTD","CrTD","NrdTD","RelTD","PrdTD"), function(x){
m <- get(paste0("coxRFX",x))
#Z <- get(sub("\\[.+","",as.character(m$call["data"])))
#i <- if(x=="Cr") 1:1540 else Z$index
Z <- if(x=="Cr") dataFrame else data[rownames(dataFrame),]
predict(m, newdata=as.data.frame(Z))}))
f <- function(x,y,...) {points(x,y, col=densCols(x,y),...); lines(lowess(x,y), col='red')}
pairs(allStagesRisk, panel=f, pch=19)
Non-complete remission deaths
w <- WaldTest(coxRFXNcdTD)
w$Q.BH <- p.adjust(w$p.value, "BH")
w$Q.BY <- p.adjust(w$p.value, "BY")
datatable(w)
sheet <- createSheet(wb, sheetName="Non-complete remission deaths")
addDataFrame(w,
sheet,
colnamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE) + Border(),
rownamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE)
)
Complete remission
w <- WaldTest(coxRFXCrTD)
w$Q.BH <- p.adjust(w$p.value, "BH")
w$Q.BY <- p.adjust(w$p.value, "BY")
datatable(w)
sheet <- createSheet(wb, sheetName="Complete remission")
addDataFrame(w,
sheet,
colnamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE) + Border(),
rownamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE)
)
Relapses
w <- WaldTest(coxRFXRelTD)
w$Q.BH <- p.adjust(w$p.value, "BH")
w$Q.BY <- p.adjust(w$p.value, "BY")
datatable(w)
sheet <- createSheet(wb, sheetName="Relapse")
addDataFrame(w,
sheet,
colnamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE) + Border(),
rownamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE)
)
Post-relapse survival
w <- WaldTest(coxRFXPrdTD)
w$Q.BH <- p.adjust(w$p.value, "BH")
w$Q.BY <- p.adjust(w$p.value, "BY")
datatable(w)
sheet <- createSheet(wb, sheetName="Post-relapse deaths")
addDataFrame(w,
sheet,
colnamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE) + Border(),
rownamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE)
)
Non-relapse deaths
w <- WaldTest(coxRFXNrdTD)
w$Q.BH <- p.adjust(w$p.value, "BH")
w$Q.BY <- p.adjust(w$p.value, "BY")
datatable(w)
sheet <- createSheet(wb, sheetName="Non-relapse deaths")
addDataFrame(w,
sheet,
colnamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE) + Border(),
rownamesStyle = CellStyle(wb) + Font(wb, isBold=TRUE)
)
saveWorkbook(wb, file="SupplementaryTables.xlsx")
The following function fits a 5-stage model. Note that we use a single smooth function g(t) to model the association between time of CR and all subsequent events. It is implemented in C++ for efficiency using the Rcpp package (Eddelbuettel and Francois 2011).
library(Rcpp)
MultiRFX5 <- function(coxRFXNcdTD, coxRFXCrTD, coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data, x =365, tdPrmBaseline = rep(1, ceiling(max(x))+1), tdOsBaseline = rep(1, ceiling(max(x))+1), ciType="analytical"){
cppFunction('NumericVector computeHierarchicalSurvival(NumericVector x, NumericVector diffS0, NumericVector S1Static, NumericVector haz1TimeDep) {
int xLen = x.size();
double h;
NumericVector overallSurvival(xLen);
for(int i = 0; i < xLen; ++i) overallSurvival[i] = 1;
for(int j = 1; j < xLen; ++j){
h = haz1TimeDep[j-1];
for(int i = j; i < xLen; ++i){
overallSurvival[i] += diffS0[j-1] * (1-pow(S1Static[i-j], h));
}
}
return overallSurvival;
}')
## Step 1: Compute KM survival curves and log hazard
getS <- function(coxRFX, data, max.x=5000) {
if(!is.null(coxRFX$na.action)) coxRFX$Z <- coxRFX$Z[-coxRFX$na.action,]
data <- as.matrix(data[,match(colnames(coxRFX$Z),colnames(data)), drop=FALSE])
r <- PredictRiskMissing(coxRFX, data, var="var2")
H0 <- basehaz(coxRFX, centered = FALSE)
hazardDist <- splinefun(H0$time, H0$hazard, method="monoH.FC")
x <- c(0:ceiling(max.x))
S <- exp(-hazardDist(x))
return(list(S=S, r=r, x=x, hazardDist=hazardDist, r0 = coxRFX$means %*% coef(coxRFX)))
}
kmCr <- getS(coxRFX = coxRFXCrTD, data = data, max.x=max(x))
kmEs <- getS(coxRFX = coxRFXNcdTD, data = data, max.x=max(x))
kmCir <- getS(coxRFX = coxRFXRelTD, data = data, max.x=max(x))
kmNrm <- getS(coxRFX = coxRFXNrdTD, data = data, max.x=max(x))
kmPrs <- getS(coxRFX = coxRFXPrdTD, data = data, max.x=max(x))
xx <- 0:ceiling(max(x))
sapply(1:nrow(data), function(i){
## Step 2: Adjust curves for competing risks, accounting for hazard
crAbs <- cumsum(c(1,diff(kmCr$S^exp(kmCr$r[i,1]))) * kmEs$S ^ exp(kmEs$r[i,1]))
esAbs <- cumsum(c(1,diff(kmEs$S^exp(kmEs$r[i,1]))) * kmCr$S ^ exp(kmCr$r[i,1])) ## array times x nrow(data)
cirCrAbs <- cumsum(c(1,diff(kmCir$S^exp(kmCir$r[i,1]))) * kmNrm$S ^ exp(kmNrm$r[i,1]))
nrsCrAbs <- cumsum(c(1,diff(kmNrm$S^exp(kmNrm$r[i,1]))) * kmCir$S ^ exp(kmCir$r[i,1])) ## array times x nrow(data)
## Step 3: Compute hierarchical survival
### Prs
rsCrAbs <- computeHierarchicalSurvival(x = xx, diffS0 = diff(cirCrAbs), S1Static = kmPrs$S, haz1TimeDep = tdPrmBaseline * exp(kmPrs$r[i,1]))
## Confidence intervals (loglog)
PlogP2 <- function(x) {(x * log(x))^2}
errOs <- kmNrm$r[i,2] * PlogP2(kmNrm$S^exp(kmNrm$r[i,1])) * (1-(1-kmCir$S ^ exp(kmCir$r[i,1]))) * (1-kmPrs$S ^ exp(kmPrs$r[i,1]))^2 + kmCir$r[i,2] * PlogP2(kmCir$S ^ exp(kmCir$r[i,1])) * (1-kmPrs$S ^ exp(kmPrs$r[i,1]))^2 * (kmNrm$S ^ exp(kmNrm$r[i,1]))^2 + kmPrs$r[i,2] * PlogP2(kmPrs$S ^ exp(kmPrs$r[i,1])) * (1-kmCir$S ^ exp(kmCir$r[i,1]))^2 * (kmNrm$S ^ exp(kmNrm$r[i,1]))^2
sdOsCr <- sqrt(errOs / PlogP2(1-(1-nrsCrAbs)-(1-rsCrAbs)))
### Overall survival from enrollment
nrsEr <- computeHierarchicalSurvival(x = xx, diffS0 = diff(crAbs), S1Static = nrsCrAbs, haz1TimeDep = tdOsBaseline)
rsEr <- computeHierarchicalSurvival(x = xx, diffS0 = diff(crAbs), S1Static = rsCrAbs, haz1TimeDep = tdOsBaseline)
cirEr <- computeHierarchicalSurvival(x = xx, diffS0 = diff(crAbs), S1Static = cirCrAbs, haz1TimeDep = tdOsBaseline)
cbind(deathInErFromEr=1-esAbs, deathInCrFromEr=1-nrsEr, deathInRelFromEr=1-rsEr, aliveInRelFromEr=1-cirEr-(1-rsEr), aliveInCrFromEr=1-crAbs - (1-cirEr) - (1-nrsEr),
deathInCrFromCr = 1-nrsCrAbs, deathInRelapseFromCr=(1-rsCrAbs), aliveInRelapseFromCr = (1-cirCrAbs) - (1-rsCrAbs), osInCrFromCrSd=sdOsCr
)
}, simplify='array')
}
PRS baseline with spline-based dep on CR length)
xmax <- 2000
xx <- 0:ceiling(xmax)
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index]))
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1]))) ## Hazard (function of CR length)
coxphOs <- coxph(Surv(time1,time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1])))
tdOsBaseline <- exp(predict(coxphOs, newdata=data.frame(time0=xx[-1]))) ## Hazard (function of induction length), only for OS (could do CIR,NRM,PRS seperately)
fiveStagePredicted <- MultiRFX5(coxRFXNcdTD, coxRFXCrTD, coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data, tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=xmax)
Function to plot stages
sedimentPlot <- function(Y, x=1:nrow(Y), y0=0, y1=NULL, col=1:ncol(Y), ...){
Z <- cbind(t(apply(cbind(y0,Y),1,cumsum)),y1)
plot(x,Z[,1], xlim=range(x), ylim=range(Z), lty=0, pch=NA,...)
for(i in 2:ncol(Z))
polygon(c(x,rev(x)), c(Z[,i-1],rev(Z[,i])), border=NA, col=col[i-1])
}
lineStage <- function(CR_date, Recurrence_date, Date_LF, ERDate, Status, y=0, col=1:5, pch.trans=19, pch.end=19, ...){
xpd <- par("xpd")
par(xpd=NA)
t <- as.numeric(c(CR_date, Recurrence_date, Date_LF) - ERDate )
w <- !is.na(t)
o <- order(t)
to <- pmin(t[o], par("usr")[2])
l <- length(to)
segments(c(0,to[-l]), rep(y,l), to, rep(y,l), col=col, lend=1, ...)
status <- if(Status == 1) 3 else 0
if(is.na(Recurrence_date))
status <- status - 1
if(is.na(CR_date))
status <- status - 1
x <- ifelse(t <= par("usr")[2], t, NA)
points(x, rep(y, length(t)), pch=c(pch.trans,pch.trans, if(Status) pch.end else NA), col=col[c(2:3,status+3)])
par(xpd=xpd)
}
Average of all multistage predictions, note the precise agreement with overall survival.
pastel1 <- brewer.pal(9, "Pastel1")
par(mfrow=c(1,1), mar=c(3,3,1,1), cex=1)
sedimentPlot(-rowMeans(fiveStagePredicted[,1:5,], dims=2), y0=1, y1=0, col=c(pastel1[c(1:3,5,4)], "#DDDDDD"))
lines(survfit(Surv(OS, Status) ~ 1, data=clinicalData))
Multistage predictions v overall survival
for(i in 1:5)
plot(summary(survfit(coxRFXFitOsTDGGc), i*365)$surv^ exp(coxRFXFitOsTDGGc$linear.predictors[1:1540]), 1-rowSums(aperm(fiveStagePredicted[,1:3,], c(3,1,2)), dim=2)[,365*3],
xlab="Survival RFX OS", ylab="Survival RFX Multistage", main=paste(i, "years"))
The following code is run on the cluster
read_chunk('../code/leaveOneOut.R', labels="leaveOneOut")
load("loo.RData")
library(mg14)
library(CoxHD)
library(Rcpp)
#save(dataFrame, nrdData, crGroups, mainGroups, prdData, relData, prdData, osData, cr, dataFrameOsTD, osTD, tplSplitOs, groups, data, whichRFXOsTDGG, clinicalData, MultiRFX5, file="../../code/loo.RData")
jobIndex <- as.numeric(Sys.getenv("LSB_JOBINDEX"))
cvIdx <- 1:nrow(dataFrame)
whichTrain <- which(cvIdx != jobIndex)
rfxNrs <- CoxRFX(nrdData[nrdData$index %in% whichTrain, names(crGroups)], Surv(nrdData$time1, nrdData$time2, nrdData$status)[nrdData$index %in% whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
rfxNrs$coefficients["transplantRel"] <- 0
#prsData$time1[!is.na(prsData$time1)] <- 0
rfxPrs <- CoxRFX(prdData[prdData$index %in% whichTrain, names(crGroups)], Surv(prdData$time1, prdData$time2, prdData$status)[prdData$index %in% whichTrain], groups=crGroups, nu=1, which.mu = intersect(mainGroups, unique(crGroups)))
rfxRel <- CoxRFX(relData[relData$index %in% whichTrain, names(crGroups)], Surv(relData$time1, relData$time2, relData$status)[relData$index %in% whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
rfxRel$coefficients["transplantRel"] <- 0
rfxCr <- CoxRFX(osData[whichTrain, names(crGroups)], Surv(cr[,1], cr[,2]==2)[whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
rfxEs <- CoxRFX(osData[whichTrain, names(crGroups)], Surv(cr[,1], cr[,2]==1)[whichTrain], groups=crGroups, which.mu = NULL)
ix <- tplSplitOs %in% whichTrain
rfxOs <- CoxRFX(dataFrameOsTD[ix,whichRFXOsTDGG], osTD[ix], groups[whichRFXOsTDGG], which.mu=mainGroups) ## allow only the main groups to have mean different from zero..
xx <- 0:2000
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index])[prdData$index %in% whichTrain,])
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
coxphOs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1]))[osData$index %in% whichTrain,])
tdOsBaseline <- exp(pmin(predict(coxphOs, newdata=data.frame(time0=500)),predict(coxphOs, newdata=data.frame(time0=xx[-1])))) ## cap predictions at induction length 500 days.
multiRfx5 <- MultiRFX5(rfxEs, rfxCr, rfxNrs, rfxRel, rfxPrs, data[cvIdx == jobIndex,,drop=FALSE], tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=2000)
save(rfxEs, rfxCr, rfxEs, rfxNrs, rfxPrs, rfxRel, rfxOs, multiRfx5, file=paste0("loo/",jobIndex,".RData"))
Multistage model
times <- round(seq(0,5,0.05)*365)
multiRfx5Loo <- sapply(mclapply(1:nrow(data), function(i){
e <- new.env()
t <- try(load(paste0("../code/loo/",i,".RData"), env=e))
if(class(t)=="try-error") rep(NA, length(times))
else e$multiRfx5[times+1,,1]
}, mc.cores=6), I, simplify="array")
Error OS
survConcordance(os ~ colSums(multiRfx5Loo[times == 3*365,1:3,]))
## Call:
## survConcordance(formula = os ~ colSums(multiRfx5Loo[times ==
## 3 * 365, 1:3, ]))
##
## n= 1540
## Concordance= 0.7237308 se= 0.01003199
## concordant discordant tied.risk tied.time std(c-d)
## 687786.00 262548.00 0.00 432.00 19067.49
ape(1-colSums(multiRfx5Loo[times == 3*365,1:3,]), os, 3*365)
## abs brier log2 bayes
## 0.3684970 0.1836208 0.7614532 0.2818182
We plot all predictions as sediments plots, laid out in the same way as the risk constellation plot, Supplementary Figure S2C
set.seed(42)
s <- sample(nrow(dataFrame),nStars^2) #1:(nStars^2)
library(HilbertVis)
nStars <- 32
l <- "coxRFXFitOsTDGGc"
t <- os#get(l)$surv
p <- PartialRisk(get(l), newZ=dataFrame[, colnames(get(l)$Z)])
p <- p[,colnames(p)!="Nuisance"]
locations <- hilbertCurve(log2(nStars))+1
mat <- matrix(order(locations[,1], locations[,2]), ncol=nStars)
h <- hclust(dist(p[s,]))
layout(mat[nStars:1,])
par(mar=c(0,0,0,0),+.5, bty="n")
for(i in 1:nStars^2){ # Fitted predictions
sedimentPlot(-fiveStagePredicted[seq(1,2001,200),1:5,s[h$order[i]]], x=seq(1,2001,200),y0=1, y1=0, col=c(pastel1[c(1:3,5,4)], "#DDDDDD"), xlab="time",ylab="fraction", xaxt="n", yaxt="n")
lines(x=seq(1,2001,200), y=1-rowSums(fiveStagePredicted[seq(1,2001,200),1:3,s[h$order[i]]]), lwd=2)
i <- s[h$order[i]]
lineStage(clinicalData$CR_date[i], clinicalData$Recurrence_date[i], clinicalData$Date_LF[i], clinicalData$ERDate[i], clinicalData$Status[i], col=c(brewer.pal(8,"Dark2")[8], set1[c(4:5,1:3)]), lwd=2, pch.trans=NA, y=0.05)
}
for(i in 1:nStars^2){ # Leave-one-out predictions
sedimentPlot(-multiRfx5Loo[seq(1,length(times),5),1:5,s[h$order[i]]], x=times[seq(1,length(times),5)],y0=1, y1=0, col=c(pastel1[c(1:3,5,4)], "#DDDDDD"), xlab="time",ylab="fraction", xaxt="n", yaxt="n")
lines(x=times[seq(1,length(times),5)], y=1-rowSums(multiRfx5Loo[seq(1,length(times),5),1:3,s[h$order[i]]]), lwd=2)
i <- s[h$order[i]]
lineStage(clinicalData$CR_date[i], clinicalData$Recurrence_date[i], clinicalData$Date_LF[i], clinicalData$ERDate[i], clinicalData$Status[i], col=c(brewer.pal(8,"Dark2")[8], set1[c(4:5,1:3)]), lwd=2, pch.trans=NA, y=0.05)
}
rfx5Loo <- sapply(mclapply(1:nrow(data), function(i){
e <- new.env()
t <- try(load(paste0("../code/loo/",i,".RData"), env=e))
if(class(t)=="try-error") rep(NA, length(times))
else {
cvIdx <- 1:nrow(dataFrame)
whichTrain <<- which(cvIdx != i)
pNrs <- predict(e$rfxNrs, newdata=data[cvIdx==i,])
pRel <- predict(e$rfxRel, newdata=data[cvIdx==i,])
pPrs <- predict(e$rfxPrs, newdata=data[cvIdx==i,])
pCr <- predict(e$rfxCr, newdata=data[cvIdx==i,])
pEs <- predict(e$rfxEs, newdata=data[cvIdx==i,])
pOs <- predict(e$rfxOs, newdata=dataFrame[cvIdx==i,])
c(pCr, pEs, pNrs, pRel, pPrs, pOs)
}
}, mc.cores=6), I, simplify="array")
colnames(rfx5Loo) <- rownames(data)
survConcordance(Surv(nrdData$time1, nrdData$time2, nrdData$status) ~ rfx5Loo[3,nrdData$index])
## Call:
## survConcordance(formula = Surv(nrdData$time1, nrdData$time2,
## nrdData$status) ~ rfx5Loo[3, nrdData$index])
##
## n=1609 (271 observations deleted due to missingness)
## Concordance= 0.6652892 se= 0.02393788
## concordant discordant tied.risk tied.time std(c-d)
## 96115.000 48356.000 0.000 16.000 6916.659
survConcordance(Surv(prdData$time1, prdData$time2, prdData$status) ~ rfx5Loo[5,rownames(prdData)[prdData$index]])
## Call:
## survConcordance(formula = Surv(prdData$time1, prdData$time2,
## prdData$status) ~ rfx5Loo[5, rownames(prdData)[prdData$index]])
##
## n= 832
## Concordance= 0.6768913 se= 0.01433183
## concordant discordant tied.risk tied.time std(c-d)
## 117928.000 56292.000 0.000 259.000 4993.785
survConcordance(Surv(relData$time1, relData$time2, relData$status) ~ rfx5Loo[4,relData$index])
## Call:
## survConcordance(formula = Surv(relData$time1, relData$time2,
## relData$status) ~ rfx5Loo[4, relData$index])
##
## n=1609 (271 observations deleted due to missingness)
## Concordance= 0.6437934 se= 0.01212035
## concordant discordant tied.risk tied.time std(c-d)
## 336890.00 186399.00 0.00 272.00 12684.89
survConcordance(Surv(cr[,1], cr[,2]==2) ~ rfx5Loo[1,])
## Call:
## survConcordance(formula = Surv(cr[, 1], cr[, 2] == 2) ~ rfx5Loo[1,
## ])
##
## n= 1540
## Concordance= 0.637342 se= 0.009305283
## concordant discordant tied.risk tied.time std(c-d)
## 642298.00 365478.00 0.00 11815.00 18755.28
survConcordance(Surv(cr[,1], cr[,2]==1) ~ rfx5Loo[2,])
## Call:
## survConcordance(formula = Surv(cr[, 1], cr[, 2] == 1) ~ rfx5Loo[2,
## ])
##
## n= 1540
## Concordance= 0.689559 se= 0.02411522
## concordant discordant tied.risk tied.time std(c-d)
## 114222.000 51423.000 0.000 151.000 7989.131
survConcordance(os ~ rfx5Loo[6,])
## Call:
## survConcordance(formula = os ~ rfx5Loo[6, ])
##
## n= 1540
## Concordance= 0.7238402 se= 0.01003199
## concordant discordant tied.risk tied.time std(c-d)
## 687890.00 262444.00 0.00 432.00 19067.49
Plot of absolute risk at 3yr, leave-one-out cross validated v outcome
par(mar=c(3,3,2,1), mgp=c(1.5,.5,0), bty="n")
t <- os
q <- quantile(t[,1], seq(0,1,.1))# q <- splinefun( s$surv, s$time,"monoH.FC")(seq(1,min(s$surv),l=10))
c <- cut(t[,1], q, na.rm=TRUE)
h <- colSums(multiRfx5Loo[times == 3*365,1:3,])
o <- order(h)
plot(h[o], col= (brewer.pal(10,'RdBu'))[c[o]], type='h', xaxt="n", xlab='', las=2, ylab="Survival at 3 years")
mtext(side=1, line=1, "Patient")
u <- par("usr")
q <- pmin(q,365*12)
image(x=q/max(q)*500, y=c(u[4]-(u[4]-u[3])/20, u[4]), matrix(1:10), col= (brewer.pal(10,'RdBu')), add=TRUE)
#axis(side=3, at=seq(1,500,l=11), labels=seq(0,1,.1))
axis(side=3, at=pretty(q/365)/max(q)*365*500, labels=pretty(q/365))
lines(ksmooth(seq_along(o),t[o,2]==0, bandwidth=50))
Plots of concordance and absolute prediction measures
multiRfx5C <- sapply(seq_along(times), function(i) survConcordance(os ~ colSums(multiRfx5Loo[i,1:3,]))$concordance[1])
plot(times/365.25, multiRfx5C, type='l', xlab="Time", ylab="Concordance", ylim=c(0.65, 0.73), col=set1[1])
abline(h=survConcordance(os ~ rfx5Loo[6,])$concordance, col=set1[2], lwd=1)
legend("bottomright",c("RFX OS","RFX Multistage"), col=set1[1:2], lty=1, bty="n")
a <- sapply(times, function(t) ape(1-colSums(multiRfx5Loo[times == t,1:3,]), os, t))
s <- summary(survfit(coxRFXFitOsTDGGc), times=times)
b <- sapply(times, function(t) ape(s$surv[times==t]^exp(rfx5Loo[6,]), os, t))
e <- sapply(times, function(t) ape(s$surv[times==t], os, t))
for(i in 1:4){
plot(times/365.25, e[i,], type='l', xlab="Time (yr)", ylab=rownames(a)[i], col=set1[9])
lines(times/365.25, a[i,], col=set1[1])
lines(times/365.25, b[i,], col=set1[2])
legend("bottomright",c("Kaplan-Meier","Multistage","RFX OS"), col=set1[c(9,1:2)], lty=1, bty="n")
}
Figure of predicted survival for 100 patients, comparing multistage and OS predictions
plot(s$surv^exp(rfx5Loo[6,1]), 1-rowSums(multiRfx5Loo[,1:3,1]), type='l', xlim=c(0,1), ylim=c(0,1), col='grey', xlab="Predicted survival RFX", ylab="Pedicted survival Multistage")
for(i in 2:100)
lines(s$surv^exp(rfx5Loo[6,i]), 1-rowSums(multiRfx5Loo[,1:3,i]), col='grey')
Data with and without allograft
allDataTpl <- osData[rep(1:nrow(dataFrame), each=3),]
allDataTpl$transplantCR1 <- rep(c(0,1,0), nrow(dataFrame))
allDataTpl$transplantRel <- rep(c(0,0,1), nrow(dataFrame))
Here we show survival predictions for three selected patients
xmax=2000
patients <- c("PD11104a","PD8314a","PD10941a")
layout(matrix(1:(3*length(patients)), byrow=TRUE, ncol=3), width=c(0.5,1,1))
par(mar=c(3,3,1,1), bty="n", mgp=c(2,.5,0), cex=1)
for(pd in patients){
e <- new.env()
i <- which(rownames(dataFrame)==pd)
load(paste0("../code/loo/",i,".RData"), env=e)
cvIdx <- 1:nrow(dataFrame)
whichTrain <<- which(cvIdx != i)
xx <- 0:2000
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index])[prdData$index %in% whichTrain,])
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
coxphOs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1]))[osData$index %in% whichTrain,])
tdOsBaseline <- exp(pmin(predict(coxphOs, newdata=data.frame(time0=500)),predict(coxphOs, newdata=data.frame(time0=xx[-1])))) ## cap predictions at induction length 500 days.
newdata <- allDataTpl[grep(pd, rownames(allDataTpl)),]
m <- MultiRFX5(e$rfxEs, e$rfxCr, e$rfxNrs, e$rfxRel, e$rfxPrs, newdata, tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=2000)
w <- seq(1,2001,10)
at <- ceiling(1:5 * 365.5)
x <- (w-1)/365.25
plot.new()
p <- which(rownames(dataFrame)==pd)
mtext(side=2, paste0(c(pd, clinicalData$AOD[p], c("male","female")[clinicalData$gender[p]], gsub(";","\n",genotype[p]), paste("ELN", clinicalData$M_Risk[p])), collapse="\n"), las=1, adj=0)
for(i in c(3,2)){
sedimentPlot(-m[w,6:8,i],x=x, y0=1, y1=0, col=pastel1[c(2:3,5,4)], xlab="Years from CR",ylab="Probability", xaxs='i', yaxs='i')
o <- 1-rowSums(m[w,6:7,i])
abline(v=c(1:5), col="white", lty=3)
abline(h=seq(0.2,0.8,0.2), col="white", lty=3)
lines(x,o, lwd=2)
lines(x,o ^ exp(qnorm(0.975) * m[w,9,i]))
lines(x,o ^ exp(-qnorm(0.975) * m[w,9,i]))
text(x=rep(0,3), c(0.1,0.2,0.3), c("AAR", "RD", "NRD") )
text(x=1:5, y=rep(0.3, 5), round(m[at,6,i],2))
text(x=1:5, y=rep(0.2, 5), round(m[at,7,i],2))
text(x=1:5, y=rep(0.1, 5), round(m[at,8,i],2))
#text(x=at, y=rep(0.1, 5), round(fiveStagePredictedTpl[w,6,i],2))
lineStage(CR_date=0, as.numeric(clinicalData$Recurrence_date[p]-clinicalData$CR_date[p])/365.25, as.numeric(clinicalData$Date_LF[p]-clinicalData$CR_date[p])/365.25, ERDate=0, clinicalData$Status[p], col=c(brewer.pal(8,"Dark2")[8], set1[c(4:5,1:3)]), lwd=4, pch.trans=NA, y=0.05, cex=4)
points(x=as.numeric(clinicalData$TPL_date[p]-clinicalData$CR_date[p])/365.25, y=0.05, pch=4, cex=1.5, lwd=1)
}}
Here we show results for an additional 4 patients, the first three being chosen as counterparts to the patients shown in Figure 3. The fourth patient is a patients with ASXL1 mutations.
xmax=2000
patients <- c("PD10828a","PD10844a","PD10829a","PD10996a")
layout(matrix(1:(3*length(patients)), byrow=TRUE, ncol=3), width=c(0.5,1,1))
par(mar=c(3,3,1,1), bty="n", mgp=c(2,.5,0), cex=1)
for(pd in patients){
e <- new.env()
i <- which(rownames(dataFrame)==pd)
load(paste0("../code/loo/",i,".RData"), env=e)
cvIdx <- 1:nrow(dataFrame)
whichTrain <<- which(cvIdx != i)
xx <- 0:2000
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index])[prdData$index %in% whichTrain,])
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
coxphOs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1]))[osData$index %in% whichTrain,])
tdOsBaseline <- exp(pmin(predict(coxphOs, newdata=data.frame(time0=500)),predict(coxphOs, newdata=data.frame(time0=xx[-1])))) ## cap predictions at induction length 500 days.
newdata <- allDataTpl[grep(pd, rownames(allDataTpl)),]
m <- MultiRFX5(e$rfxEs, e$rfxCr, e$rfxNrs, e$rfxRel, e$rfxPrs, newdata, tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=2000)
w <- seq(1,2001,10)
at <- ceiling(1:5 * 365.5)
x <- (w-1)/365.25
plot.new()
p <- which(rownames(dataFrame)==pd)
mtext(side=2, paste0(c(pd, clinicalData$AOD[p], c("male","female")[clinicalData$gender[p]], gsub(";","\n",genotype[p]), paste("ELN", clinicalData$M_Risk[p])), collapse="\n"), las=1, adj=0)
for(i in c(3,2)){
sedimentPlot(-m[w,6:8,i],x=x, y0=1, y1=0, col=pastel1[c(2:3,5,4)], xlab="Years from CR",ylab="Probability", xaxs='i', yaxs='i')
o <- 1-rowSums(m[w,6:7,i])
abline(v=c(1:5), col="white", lty=3)
abline(h=seq(0.2,0.8,0.2), col="white", lty=3)
lines(x,o, lwd=2)
lines(x,o ^ exp(qnorm(0.975) * m[w,9,i]))
lines(x,o ^ exp(-qnorm(0.975) * m[w,9,i]))
text(x=rep(0,3), c(0.1,0.2,0.3), c("AAR", "RD", "NRD") )
text(x=1:5, y=rep(0.3, 5), round(m[at,6,i],2))
text(x=1:5, y=rep(0.2, 5), round(m[at,7,i],2))
text(x=1:5, y=rep(0.1, 5), round(m[at,8,i],2))
#text(x=at, y=rep(0.1, 5), round(fiveStagePredictedTpl[w,6,i],2))
lineStage(CR_date=0, as.numeric(clinicalData$Recurrence_date[p]-clinicalData$CR_date[p])/365.25, as.numeric(clinicalData$Date_LF[p]-clinicalData$CR_date[p])/365.25, ERDate=0, clinicalData$Status[p], col=c(brewer.pal(8,"Dark2")[8], set1[c(4:5,1:3)]), lwd=4, pch.trans=NA, y=0.05, cex=4)
points(x=as.numeric(clinicalData$TPL_date[p]-clinicalData$CR_date[p])/365.25, y=0.05, pch=4, cex=1.5, lwd=1)
}}
We use the following function to compute the hierarchical adjustment for two subsequent stages.
cppFunction('NumericVector computeTotalPrsC(NumericVector x, NumericVector diffCir, NumericVector prsP, NumericVector tdPrmBaseline, double risk) {
int xLen = x.size();
double hj;
double r = exp(risk);
NumericVector rs(xLen);
for(int i = 0; i < xLen; ++i) rs[i] = 1;
for(int j = 1; j < xLen; ++j){
hj = tdPrmBaseline[j-1] * r;
for(int i = j; i < xLen; ++i){
rs[i] += diffCir[j-1] * (1-pow(prsP[i-j], hj));
}
}
return rs;
}', rebuild=TRUE)
Function to predict OS from Relapse, PRS and NRM, as described in Section 4.3.5. It is slightly more efficient than MultiRFX5, as it doesn’t require evaluating the chances of reaching CR.
MultiRFX3 <- function(coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data, x =365, ciType="analytical", prdData){
## Step 1: Compute KM survival curves and log hazard
getS <- function(coxRFX, data, max.x=5000) {
if(!is.null(coxRFX$na.action)) coxRFX$Z <- coxRFX$Z[-coxRFX$na.action,]
data <- as.matrix(data[,match(colnames(coxRFX$Z),colnames(data))])
r <- PredictRiskMissing(coxRFX, data, var="var2")
H0 <- basehaz(coxRFX, centered = FALSE)
hazardDist <- splinefun(H0$time, H0$hazard, method="monoH.FC")
x <- c(0:max.x,max.x)
S <- exp(-hazardDist(x))
return(list(S=S, r=r, x=x, hazardDist=hazardDist, r0 = coxRFX$means %*% coef(coxRFX)))
}
kmRel <- getS(coxRFX = coxRFXRelTD, data = data, max.x=max(x))
kmNrd <- getS(coxRFX = coxRFXNrdTD, data = data, max.x=max(x))
kmPrd <- getS(coxRFX = coxRFXPrdTD, data = data, max.x=max(x))
## Step 2: Adjust CIR and NRM curve for competing risks, accounting for hazard
kmRel$Sadj <- sapply(1:nrow(data), function(i) cumsum(c(1,diff(kmRel$S^exp(kmRel$r[i,1]))) * kmNrd$S ^ exp(kmNrd$r[i,1])))
kmNrd$Sadj <- sapply(1:nrow(data), function(i) cumsum(c(1,diff(kmNrd$S^exp(kmNrd$r[i,1]))) * kmRel$S ^ exp(kmRel$r[i,1]))) ## array times x nrow(data)
stopifnot(length(x)==1 | length(x) == nrow(data))
if(length(x)==nrow(data))
w <- match(x,kmRel$x)
else if(length(x)==1)
w <- rep(match(x, kmRel$x), nrow(data))
y <- mapply(function(i,j) kmNrd$Sadj[i,j], w,1:length(w) ) # select time for each sample
nrs <- y
nrsUp <- y^exp(2*sqrt(kmNrd$r[,2]))
nrsLo <- y^exp(- 2*sqrt(kmNrd$r[,2]))
y <- mapply(function(i,j) kmRel$Sadj[i,j], w,1:length(w) ) # select time for each sample
cir <- y
cirLo <- y^exp( 2*sqrt(kmRel$r[,2]))
cirUp <- y^exp( - 2*sqrt(kmRel$r[,2]))
## Step 3: Compute post-relapse survival
survPredict <- function(surv){
s <- survfit(surv~1)
splinefun(s$time, s$surv, method="monoH.FC")
}
xx <- 0:max(x)
# Baseline Prs (measured from relapse)
kmPrs0 <- survPredict(Surv(prdData$time1, prdData$time2, prdData$status))(xx)
# PRS baseline with spline-based dep on CR length)
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=prdData )
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
rs <- sapply(1:nrow(data), function(i){
### Different approach
xLen <- 1+floor(x)
cir <- kmRel$Sadj[1:xLen,i]
rs <- computeTotalPrsC(x = xx, diffCir = diff(cir), prsP = kmPrs0, tdPrmBaseline = tdPrmBaseline, risk = kmPrd$r[i,1]-kmPrd$r0)
rs[xLen]
})
## Step 4: Combine into overall survival
if(any(1-(1-rs)-(1-nrs)<0)) warning("OS < 0 occured.")
os <- pmax(pmin(1-(1-rs)-(1-nrs),1),0)
## Step 5: Confidence intervals for OS
osCi <- sapply(1:nrow(data), function(i){
if("analytical" == ciType){
## Confidence intervals
PlogP2 <- function(x) {(x * log(x))^2}
errOs <- kmNrd$r[i,2] * PlogP2(kmNrd$S[w[i]]) * (1-kmRel$S[w[i]] * kmPrd$S[w[i]])^2 + kmRel$r[i,2] * (1-kmNrd$S[w[i]])^2* kmPrd$S[w[i]]^2 * PlogP2(kmRel$S[w[i]]) + kmPrd$r[i,2] * (1-kmNrd$S[w[i]])^2* kmRel$S[w[i]]^2 * PlogP2(kmPrd$S[w[i]])
errOs <- errOs / PlogP2(1-(1-kmNrd$S[w[i]])*(1-kmRel$S[w[i]]*kmPrd$S[w[i]]))
return(c(osUp=os[i] ^ exp(-2* errOs), osLo= os[i] ^ exp(+2*errOs)))
} else if("simulated" == ciType){
## Simulate CI
nSim <- 200
osCiMc <- sapply(1:nSim, function(foo){
H <- exp(rnorm(3,c(kmRel$r[i,1],kmNrd$r[i,1],kmPrd$r[i,1]),sqrt(c(kmRel$r[i,2],kmNrd$r[i,2],kmPrd$r[i,2]))))
nrs <- cumsum(c(1,diff(kmNrd$S^H[2]) * kmRel$S[-1]^H[1])) ## Correct KM estimate for competing risk
diffCir <- diff(kmRel$inc^H[1]) * kmNrd$inc[-1]^H[2] ## Correct KM estimate for competing risk
rs <- computeTotalPrsC(x = x, diffCir = diffCir, prsP = kmPrs0, tdPrmBaseline = tdPrmBaseline, risk = -kmPrd$r0+log(H[3]))
return((1-(1-nrs)-(1-rs))[w[i]])
})
osCiMcQ <- quantile(osCiMc, c(0.025,0.975))
return(c(osUp = osCiMcQ[2], osLo = osCiMcQ[1]))
}
})
return(data.frame(os=os, osLo = osCi[2,], osUp = osCi[1,], cir=cir, cirLo=cirLo, cirUp=cirUp, nrs=nrs, nrsLo=nrsLo, nrsUp=nrsUp, rs=rs ))
}
Create a data.frame with all data in cr
allData <- MakeTimeDependent(dataFrame[whichRFXRel], timeEvent=alloTimeCR1, timeStop=as.numeric(clinicalData$Date_LF- clinicalData$CR_date), status=clinicalData$Status)
allData$transplantCR1 <- allData$event
allData$transplantRel <- allData$event
w <- which(clinicalData$TPL_date > clinicalData$Recurrence_date)
allData$transplantCR1[allData$index %in% w] <- 0
allData$transplantRel[!allData$index %in% w] <- 0
multiRFX3 <- MultiRFX3(coxRFXNrdTD = coxRFXNrdTD, coxRFXPrdTD = coxRFXPrdTD, coxRFXRelTD = coxRFXRelTD, data=allData, x=3*365, prdData=prdData)
replicates <- 100 ## number of replicates
concordanceCIRcv <- lapply(list(crGroups[crGroups %in% mainGroups], crGroups), function(g){
mclapply(1:replicates, function(foo){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 4/5
dNrm <- nrdData[nrdData$index %in% which(trainIdx),names(g)]
sNrm <- Surv(nrdData$time1, nrdData$time2, nrdData$status)[nrdData$index %in% which(trainIdx)]
coxRFXNrdTD <- CoxRFX(dNrm, sNrm, groups=g, nu=1, which.mu = mainGroups)
coxRFXNrdTD$coefficients["transplantRel"] <- 0
dPrs <- prdData[prdData$index %in% which(trainIdx), c(names(g),"time0","time1","time2","status")]
sPrs <- Surv(prdData$time1, prdData$time2, prdData$status)[prdData$index %in% which(trainIdx)]
coxRFXPrdTD <- CoxRFX(dPrs, sPrs, groups=g, nu=1, which.mu = mainGroups)
dCir <- relData[relData$index %in% which(trainIdx), names(g)]
sCir <- Surv(relData$time1, relData$time2, relData$status)[relData$index %in% which(trainIdx)]
coxRFXRelTD <- CoxRFX(dCir, sCir, groups=g, which.mu = mainGroups)
coxRFXRelTD$coefficients["transplantRel"] <- 0
dOs <- osData[osData$index %in% which(trainIdx), names(g)]
sOs <- Surv(osData$time1, osData$time2, osData$status)[osData$index %in% which(trainIdx)]
coxRFXOsCR <- CoxRFX(dOs, sOs, groups=g, which.mu = mainGroups)
allRisk365 <- MultiRFX3(coxRFXNrdTD = coxRFXNrdTD, coxRFXPrdTD = coxRFXPrdTD, coxRFXRelTD = coxRFXRelTD, data=allData, x=365, prdData=dPrs)
allRisk1000 <- MultiRFX3(coxRFXNrdTD = coxRFXNrdTD, coxRFXPrdTD = coxRFXPrdTD, coxRFXRelTD = coxRFXRelTD, data=allData, x=1000, prdData=dPrs)
p365 <- -allRisk365[,1]
p1000 <- -allRisk1000[,1]
pCIR <- as.matrix(relData[names(g)]) %*% coef(coxRFXRelTD)
pPRS <- as.matrix(prdData[names(g)]) %*% coef(coxRFXPrdTD)
pNRM <- as.matrix(nrdData[names(g)]) %*% coef(coxRFXNrdTD)
pOS <- as.matrix(osData[names(g)]) %*% coef(coxRFXOsCR)
C <- c(
CIRrfx = survConcordance(Surv(time1, time2, status)~ pCIR, data=relData, subset = relData$index %in% which(!trainIdx) )$concordance,
PRSrfx = survConcordance(Surv(time1, time2, status) ~ pPRS, data=prdData, subset=prdData$index %in% which(!trainIdx) )$concordance,
NRMrfx = survConcordance(Surv(time1, time2, status)~ pNRM, data=nrdData, subset=nrdData$index %in% which(!trainIdx) )$concordance,
OSrfx = survConcordance(Surv(time1, time2, status) ~ pOS, data=osData, subset=osData$index %in% which(!trainIdx) )$concordance,
OS365 = survConcordance(Surv(time1, time2, status) ~ p365, data=osData, subset=osData$index %in% which(!trainIdx) )$concordance,
OS1000 = survConcordance(Surv(time1,time2, status) ~ p1000, data=osData, subset=osData$index %in% which(!trainIdx) )$concordance
)
coef <- cbind(CIRrfx=coef(coxRFXRelTD), PRSrfx=coef(coxRFXPrdTD), NRMrfx=coef(coxRFXNrdTD), OSrfx=coef(coxRFXOsCR))
return(list(C=C, coef=coef, allRisk365=allRisk365, allRisk1000=allRisk1000))
}, mc.cores=10)
})
apply(apply(-sapply(concordanceCIRcv[[1]], `[[` , "C")[4:6,],2,rank),1,function(x) table(factor(x, levels=1:3)))
## OSrfx.concordant OS365.concordant OS1000.concordant
## 1 70 9 20
## 2 9 25 65
## 3 21 65 14
apply(apply(-sapply(concordanceCIRcv[[2]], `[[` , "C")[4:6,],2,rank),1,function(x) table(factor(x, levels=1:3)))
## OSrfx.concordant OS365.concordant OS1000.concordant
## 1 39 7 53
## 2 23 31 45
## 3 37 61 2
Test and train errors
i <- 0
concordanceCIRcvTrain <- lapply(list(crGroups[crGroups %in% mainGroups], crGroups), function(g){
i <- i+1
sapply(1:replicates, function(foo){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 4/5
coef <- concordanceCIRcv[[i]][[foo]][["coef"]]
pCIR <- as.matrix(relData[names(coef[,"CIRrfx"])]) %*% coef[,"CIRrfx"]
pPRS <- as.matrix(prdData[names(coef[,"PRSrfx"])]) %*% coef[,"PRSrfx"]
pNRM <- as.matrix(nrdData[names(coef[,"NRMrfx"])]) %*% coef[,"NRMrfx"]
pOS <- as.matrix(osData[names(coef[,"OSrfx"])]) %*% coef[,"OSrfx"]
p365 <- -concordanceCIRcv[[i]][[foo]][["allRisk365"]]$os
p1000 <- -concordanceCIRcv[[i]][[foo]][["allRisk1000"]]$os
C <- sapply(list(train=which(trainIdx), test=which(!trainIdx)), function(w)
c(
CIRrfx = survConcordance(Surv(time1, time2, status)~ pCIR, data=relData, subset = relData$index %in% w )$concordance,
PRSrfx = survConcordance(Surv(time1, time2, status) ~ pPRS, data=prdData, subset=prdData$index %in% w )$concordance,
NRMrfx = survConcordance(Surv(time1, time2, status)~ pNRM, data=nrdData, subset=nrdData$index %in% w )$concordance,
OSrfx = survConcordance(Surv(time1, time2, status) ~ pOS, data=osData, subset=osData$index %in% w )$concordance,
OS365 = survConcordance(Surv(time1, time2, status) ~ p365, data=osData, subset=osData$index %in% w )$concordance,
OS1000 = survConcordance(Surv(time1,time2, status) ~ p1000, data=osData, subset=osData$index %in% w )$concordance
))
return(C)
}, simplify='array')
})
Plot test and training errors
for(i in 1:4){
plot(t(concordanceCIRcvTrain[[2]][i,,] ), main=rownames(concordanceCIRcvTrain[[2]])[i])
abline(0,1)
}
Plot coefficients v mean of subsampled coef
r <- rowMeans(sapply(concordanceCIRcv[[2]], `[[` , "coef", simplify="array"), dim=2)
plot(r[,1], coef(coxRFXRelTD)); abline(0,1)
plot(r[,2], coef(coxRFXPrdTD)); abline(0,1)
plot(r[,3], coef(coxRFXNrdTD)); abline(0,1)
plot(r[,4], coef(coxRFXOsCR)); abline(0,1)
Variance-based concordance estimate
i <- 0
concordanceCIRcvVar <- lapply(list(crGroups[crGroups %in% mainGroups], crGroups), function(g){
i <- i+1
sapply(1:replicates, function(foo){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 4/5
coef <- concordanceCIRcv[[i]][[foo]][["coef"]]
pCIR <- as.matrix(relData[names(coef[,"CIRrfx"])]) %*% coef[,"CIRrfx"]
pPRS <- as.matrix(prdData[names(coef[,"PRSrfx"])]) %*% coef[,"PRSrfx"]
pNRM <- as.matrix(nrdData[names(coef[,"NRMrfx"])]) %*% coef[,"NRMrfx"]
pOS <- as.matrix(osData[names(coef[,"OSrfx"])]) %*% coef[,"OSrfx"]
C <- sapply(list(train=which(trainIdx), test=which(!trainIdx)), function(w){
sapply(ls(sys.frame(-3),pattern='^p[A-Z]+'), function(x)
CoxHD:::ConcordanceFromVariance(var(get(x)[w], na.rm=TRUE)))[c(1,4,2,3)]
})
}, simplify="array")})
for(i in 1:4)
{cat(rownames(concordanceCIRcvTrain[[2]])[i],"\n"); print(summary(data.frame(harrel=t(concordanceCIRcvTrain[[2]][i,1:2,]) , var=t(concordanceCIRcvVar[[2]][i,1:2,]))))}
## CIRrfx.concordant
## harrel.train harrel.test var.train var.test
## Min. :0.7055 Min. :0.6289 Min. :0.6609 Min. :0.6546
## 1st Qu.:0.7176 1st Qu.:0.6681 1st Qu.:0.6711 1st Qu.:0.6691
## Median :0.7207 Median :0.6802 Median :0.6760 Median :0.6759
## Mean :0.7208 Mean :0.6807 Mean :0.6759 Mean :0.6755
## 3rd Qu.:0.7248 3rd Qu.:0.6923 3rd Qu.:0.6800 3rd Qu.:0.6813
## Max. :0.7353 Max. :0.7428 Max. :0.6972 Max. :0.6937
## PRSrfx.concordant
## harrel.train harrel.test var.train var.test
## Min. :0.6744 Min. :0.5938 Min. :0.6491 Min. :0.6344
## 1st Qu.:0.6828 1st Qu.:0.6335 1st Qu.:0.6577 1st Qu.:0.6525
## Median :0.6870 Median :0.6528 Median :0.6603 Median :0.6587
## Mean :0.6873 Mean :0.6516 Mean :0.6612 Mean :0.6596
## 3rd Qu.:0.6913 3rd Qu.:0.6679 3rd Qu.:0.6645 3rd Qu.:0.6656
## Max. :0.7023 Max. :0.7141 Max. :0.6772 Max. :0.6878
## NRMrfx.concordant
## harrel.train harrel.test var.train var.test
## Min. :0.7312 Min. :0.5693 Min. :0.6617 Min. :0.6601
## 1st Qu.:0.7573 1st Qu.:0.6424 1st Qu.:0.6840 1st Qu.:0.6835
## Median :0.7657 Median :0.6714 Median :0.6932 Median :0.6902
## Mean :0.7652 Mean :0.6691 Mean :0.6920 Mean :0.6912
## 3rd Qu.:0.7729 3rd Qu.:0.6962 3rd Qu.:0.6985 3rd Qu.:0.7000
## Max. :0.7975 Max. :0.7836 Max. :0.7194 Max. :0.7231
## OSrfx.concordant
## harrel.train harrel.test var.train var.test
## Min. :0.7123 Min. :0.6298 Min. :0.6804 Min. :0.6805
## 1st Qu.:0.7250 1st Qu.:0.6714 1st Qu.:0.6967 1st Qu.:0.6941
## Median :0.7283 Median :0.6876 Median :0.7000 Median :0.7000
## Mean :0.7284 Mean :0.6892 Mean :0.7000 Mean :0.6999
## 3rd Qu.:0.7320 3rd Qu.:0.7055 3rd Qu.:0.7038 3rd Qu.:0.7058
## Max. :0.7422 Max. :0.7577 Max. :0.7175 Max. :0.7232
concordanceCIRcvTrial <- mclapply(list(crGroups[crGroups %in% mainGroups], crGroups), function(g){
mclapply(levels(clinicalData$Study), function(study){
trainIdx <- clinicalData$Study != study
g <- g[colSums(allData[allData$index %in% which(trainIdx), names(g)])>0]
if(study == "AMLSG0704") g <- g[names(g) != "AMLHD98B"] # avoid collinearity
dNrm <- nrdData[nrdData$index %in% which(trainIdx),names(g)]
sNrm <- Surv(nrdData$time1, nrdData$time2, nrdData$status)[nrdData$index %in% which(trainIdx)]
coxRFXNrdTD <- CoxRFX(dNrm, sNrm, groups=g, nu=1, which.mu = mainGroups)
coxRFXNrdTD$coefficients["transplantRel"] <- 0
dPrs <- prdData[prdData$index %in% which(trainIdx), c(names(g),"time0","time1","time2","status")]
sPrs <- Surv(prdData$time1, prdData$time2, prdData$status)[prdData$index %in% which(trainIdx)]
coxRFXPrdTD <- CoxRFX(dPrs, sPrs, groups=g, nu=1, which.mu = mainGroups)
dCir <- relData[relData$index %in% which(trainIdx), names(g)]
sCir <- Surv(relData$time1, relData$time2, relData$status)[relData$index %in% which(trainIdx)]
coxRFXRelTD <- CoxRFX(dCir, sCir, groups=g, which.mu = mainGroups)
coxRFXRelTD$coefficients["transplantRel"] <- 0
dOs <- osData[osData$index %in% which(trainIdx), names(g)]
sOs <- Surv(osData$time1, osData$time2, osData$status)[osData$index %in% which(trainIdx)]
coxRFXOsCR <- CoxRFX(dOs, sOs, groups=g, which.mu = mainGroups)
allRisk365 <- MultiRFX3(coxRFXNrdTD = coxRFXNrdTD, coxRFXPrdTD = coxRFXPrdTD, coxRFXRelTD = coxRFXRelTD, data=allData, x=365, prdData=dPrs)
allRisk1000 <- MultiRFX3(coxRFXNrdTD = coxRFXNrdTD, coxRFXPrdTD = coxRFXPrdTD, coxRFXRelTD = coxRFXRelTD, data=allData, x=1000, prdData=dPrs)
p365 <- -allRisk365[,1]
p1000 <- -allRisk1000[,1]
pCIR <- as.matrix(relData[names(g)]) %*% coef(coxRFXRelTD)
pPRS <- as.matrix(prdData[names(g)]) %*% coef(coxRFXPrdTD)
pNRM <- as.matrix(nrdData[names(g)]) %*% coef(coxRFXNrdTD)
pOS <- as.matrix(osData[names(g)]) %*% coef(coxRFXOsCR)
C <- c(
CIRrfx = survConcordance(Surv(time1, time2, status)~ pCIR, data=relData, subset = relData$index %in% which(!trainIdx) )$concordance,
PRSrfx = survConcordance(Surv(time2 - time1, status) ~ pPRS, data=prdData, subset=prdData$index %in% which(!trainIdx) )$concordance,
NRMrfx = survConcordance(Surv(time1, time2, status)~ pNRM, data=nrdData, subset=nrdData$index %in% which(!trainIdx) )$concordance,
OSrfx = survConcordance(Surv(time1, time2, status) ~ pOS, data=osData, subset=osData$index %in% which(!trainIdx) )$concordance,
OS365 = survConcordance(Surv(time1, time2, status) ~ p365, data=osData, subset=osData$index %in% which(!trainIdx) )$concordance,
OS1000 = survConcordance(Surv(time1,time2, status) ~ p1000, data=osData, subset=osData$index %in% which(!trainIdx) )$concordance
)
coef <- cbind(RELrfx=coef(coxRFXRelTD), PRSrfx=coef(coxRFXPrdTD), NRSrfx=coef(coxRFXNrdTD), OSrfx=coef(coxRFXOsCR))
return(list(C=C, coef=coef, allRisk365=allRisk365, allRisk1000=allRisk1000))
}, mc.cores=3)
}, mc.cores=2)
dotplot(sapply(concordanceCIRcvTrial[[1]], `[[` , "C")[4:6,])
dotplot(sapply(concordanceCIRcvTrial[[2]], `[[` , "C")[4:6,])
# CIR
c <- coxph(Surv(time1,time2,status)~transplantCR1*AOD_10, data=relData)
print(c)
## Call:
## coxph(formula = Surv(time1, time2, status) ~ transplantCR1 *
## AOD_10, data = relData)
##
##
## coef exp(coef) se(coef) z p
## transplantCR1 0.514 1.671 0.4349 1.18 2.4e-01
## AOD_10 0.193 1.213 0.0385 5.01 5.4e-07
## transplantCR1:AOD_10 -0.255 0.775 0.0948 -2.69 7.2e-03
##
## Likelihood ratio test=71.2 on 3 df, p=2.33e-15 n= 1609, number of events= 615
## (271 observations deleted due to missingness)
anova(c)
## Analysis of Deviance Table
## Cox model: response is Surv(time1, time2, status)
## Terms added sequentially (first to last)
##
## loglik Chisq Df Pr(>|Chi|)
## NULL -4115.7
## transplantCR1 -4093.3 44.9037 1 2.070e-11 ***
## AOD_10 -4083.6 19.3132 1 1.109e-05 ***
## transplantCR1:AOD_10 -4080.1 7.0165 1 0.008076 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#NRM
c <- coxph(Surv(time1,time2,status)~transplantCR1*AOD_10, data=nrdData)
print(c)
## Call:
## coxph(formula = Surv(time1, time2, status) ~ transplantCR1 *
## AOD_10, data = nrdData)
##
##
## coef exp(coef) se(coef) z p
## transplantCR1 0.4899 1.63 0.7724 0.634 0.5300
## AOD_10 0.2847 1.33 0.0885 3.217 0.0013
## transplantCR1:AOD_10 0.0676 1.07 0.1545 0.438 0.6600
##
## Likelihood ratio test=36.7 on 3 df, p=5.36e-08 n= 1609, number of events= 165
## (271 observations deleted due to missingness)
anova(c)
## Analysis of Deviance Table
## Cox model: response is Surv(time1, time2, status)
## Terms added sequentially (first to last)
##
## loglik Chisq Df Pr(>|Chi|)
## NULL -1097.4
## transplantCR1 -1088.7 17.4353 1 2.973e-05 ***
## AOD_10 -1079.1 19.0581 1 1.268e-05 ***
## transplantCR1:AOD_10 -1079.0 0.1927 1 0.6607
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#PRS
c <- coxph(Surv(time1,time2,status)~ transplantRel*AOD_10, data=prdData)
print(c)
## Call:
## coxph(formula = Surv(time1, time2, status) ~ transplantRel *
## AOD_10, data = prdData)
##
##
## coef exp(coef) se(coef) z p
## transplantRel 0.187 1.205 0.4500 0.415 6.8e-01
## AOD_10 0.205 1.228 0.0426 4.814 1.5e-06
## transplantRel:AOD_10 -0.147 0.863 0.0929 -1.585 1.1e-01
##
## Likelihood ratio test=49.1 on 3 df, p=1.23e-10 n= 832, number of events= 479
anova(c)
## Analysis of Deviance Table
## Cox model: response is Surv(time1, time2, status)
## Terms added sequentially (first to last)
##
## loglik Chisq Df Pr(>|Chi|)
## NULL -2772.3
## transplantRel -2760.0 24.6720 1 6.797e-07 ***
## AOD_10 -2749.0 21.9728 1 2.765e-06 ***
## transplantRel:AOD_10 -2747.8 2.4687 1 0.1161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This function computes the average accuracy of multiple absolute survival predictions at a given point in time by subdividing them into equally sized bins and computing the weighted average absolute difference of the KM estimated survival probability and predicted.
EvalAbsolutePred <- function(prediction, surv, time, bins=seq(0,1,0.05)){
c <- cut(prediction, bins)
f <- survfit(surv ~ c)
e <- summary(f, time)
x <- sapply(strsplit(gsub("[a-z\\=\\(]|]","",e$strata),","), function(x) mean(as.numeric(x)));
#w <- 1/(e$std.err+.Machine$double.eps)^2
w <- e$n[e$strata]
std.err = 1/sum(w, na.rm=TRUE)
mean.error = sum(abs(e$surv-x)*w, na.rm=TRUE)*std.err
return(list(mean.error=mean.error, std.err=std.err, survfit=e, x=x))
}
Absolute prediction error
absPredError <- EvalAbsolutePred(multiRFX3$os, Surv(allData$time1, allData$time2, allData$status), time=3*365)
plot(absPredError$x, absPredError$survfit$surv, xlim=c(0,1), ylim=c(0,1), xlab="Predicted probability", ylab="Observed", main="Prediction tool")
segments(absPredError$x, absPredError$survfit$lower,absPredError$x, absPredError$survfit$upper)
abline(0,1)
PredictAbsoluteCoxph <- function(coxRFXOsCR, allData, time) {
s <- survfit(coxRFXOsCR)
q <- s$surv[which.min(abs(s$time-time))] ^ exp(predict(coxRFXOsCR, newdata=allData))
}
q <- PredictAbsoluteCoxph(coxRFXOsCR = coxRFXOsCR, allData = allData, time=365)
absPredErrorOs <- EvalAbsolutePred(q, Surv(allData$time1, allData$time2, allData$status), time=365)
plot(absPredErrorOs$x, absPredErrorOs$survfit$surv, xlim=c(0,1), ylim=c(0,1), xlab="Predicted probability", ylab="Observed", main="RFX on OS")
segments(absPredErrorOs$x, absPredErrorOs$survfit$lower,absPredErrorOs$x, absPredErrorOs$survfit$upper)
abline(0,1)
Eval cross-validated samples
i <- 0
absoluteErrorsCIRcv <- lapply(list(crGroups[crGroups %in% mainGroups], crGroups), function(g){
i <- i+1
sapply(1:replicates, function(foo){
set.seed(foo)
time <- 365
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 4/5
coef <- concordanceCIRcv[[i]][[foo]][["coef"]]
lpCIR <- as.matrix(relData[names(coef[,"CIRrfx"])]) %*% coef[,"CIRrfx"]
s <- survfit(Surv(time1, time2, status)~1, data=relData, subset=relData$index %in% which(trainIdx))
pCIR <- s$surv[which.min(abs(s$time-time))] ^ exp(lpCIR-mean(lpCIR[relData$index %in%which(trainIdx)]))
lpPRS <- as.matrix(prdData[names(coef[,"PRSrfx"])]) %*% coef[,"PRSrfx"]
s <- survfit(Surv(time2- time1, status)~1, data=prdData, subset=prdData$index %in% which(trainIdx))
pPRS <- s$surv[which.min(abs(s$time-time))] ^ exp(lpPRS-mean(lpPRS[prdData$index %in% which(trainIdx)]))
lpNRM <- as.matrix(nrdData[names(coef[,"NRMrfx"])]) %*% coef[,"NRMrfx"]
s <- survfit(Surv(time1, time2, status)~1, data=nrdData, subset=nrdData$index %in% which(trainIdx))
pNRM <- s$surv[which.min(abs(s$time-time))] ^ exp(lpNRM-mean(lpNRM[nrdData$index %in% which(trainIdx)]))
lpOS <- as.matrix(osData[names(coef[,"OSrfx"])]) %*% coef[,"OSrfx"]
s <- survfit(Surv(time1, time2, status)~1, data=osData, subset=osData$index %in% which(trainIdx))
pOS <- s$surv[which.min(abs(s$time-time))] ^ exp(lpOS-mean(lpOS[osData$index %in% which(trainIdx)]))
p365 <- concordanceCIRcv[[i]][[foo]][["allRisk365"]]$os
p1000 <- concordanceCIRcv[[i]][[foo]][["allRisk1000"]]$os
err <- sapply(list(train=which(trainIdx), test=which(!trainIdx)), function(w)
c(
CIRrfx = EvalAbsolutePred(pCIR[relData$index %in% w ], Surv(relData$time1, relData$time2, relData$status)[relData$index %in% w ], time=365)$mean.error,
PRSrfx = EvalAbsolutePred(pPRS[prdData$index %in% w ], Surv(prdData$time1, prdData$time2, prdData$status)[prdData$index %in% w ], time=365)$mean.error,
NRMrfx = EvalAbsolutePred(pNRM[nrdData$index %in% w ], Surv(nrdData$time1, nrdData$time2, nrdData$status)[nrdData$index %in% w ], time=365)$mean.error,
OSrfx = EvalAbsolutePred(pOS[osData$index %in% w ], Surv(osData$time1, osData$time2, osData$status)[osData$index %in% w ], time=365)$mean.error,
OS365 = EvalAbsolutePred(p365[osData$index %in% w ], Surv(osData$time1, osData$time2, osData$status)[osData$index %in% w ], time=365)$mean.error,
OS1000 = EvalAbsolutePred(p1000[osData$index %in% w ], Surv(osData$time1, osData$time2, osData$status)[osData$index %in% w ], time=1000)$mean.error
))
return(err)
}, simplify='array')
})
summary(t(absoluteErrorsCIRcv[[2]][,1,]))
## CIRrfx PRSrfx NRMrfx OSrfx OS365 OS1000
## Min. :0.03669 Min. :0.03836 Min. :0.01515 Min. :0.03073 Min. :0.02095 Min. :0.03204
## 1st Qu.:0.04551 1st Qu.:0.05974 1st Qu.:0.02360 1st Qu.:0.04013 1st Qu.:0.03136 1st Qu.:0.04235
## Median :0.04949 Median :0.06497 Median :0.02768 Median :0.04354 Median :0.03477 Median :0.04800
## Mean :0.05081 Mean :0.06524 Mean :0.02781 Mean :0.04377 Mean :0.03576 Mean :0.04801
## 3rd Qu.:0.05629 3rd Qu.:0.07118 3rd Qu.:0.03175 3rd Qu.:0.04767 3rd Qu.:0.03938 3rd Qu.:0.05295
## Max. :0.06849 Max. :0.08820 Max. :0.04339 Max. :0.05674 Max. :0.05117 Max. :0.07093
boxplot(t(absoluteErrorsCIRcv[[2]][,1,]), main="Training")
summary(t(absoluteErrorsCIRcv[[2]][,2,]))
## CIRrfx PRSrfx NRMrfx OSrfx OS365 OS1000
## Min. :0.04373 Min. :0.07699 Min. :0.01895 Min. :0.04045 Min. :0.02840 Min. :0.05755
## 1st Qu.:0.07425 1st Qu.:0.10945 1st Qu.:0.03944 1st Qu.:0.06770 1st Qu.:0.05942 1st Qu.:0.08178
## Median :0.08979 Median :0.12459 Median :0.04902 Median :0.08073 Median :0.07175 Median :0.09328
## Mean :0.08967 Mean :0.12564 Mean :0.04901 Mean :0.07998 Mean :0.07166 Mean :0.09584
## 3rd Qu.:0.10427 3rd Qu.:0.13970 3rd Qu.:0.05801 3rd Qu.:0.09100 3rd Qu.:0.08469 3rd Qu.:0.10879
## Max. :0.15110 Max. :0.19326 Max. :0.08479 Max. :0.12246 Max. :0.12315 Max. :0.14889
boxplot(t(absoluteErrorsCIRcv[[2]][,2,]), main="Test")
riskCol=set1[c(1,3,4,2)]
names(riskCol) <- levels(clinicalData$M_Risk)
i <- 1
rsStatus <- osData$status
rsStatus[osData$index %in% nrdData$index[nrdData$status==1]] <- 0
par(mfrow=c(2,2), mar=c(3,3,1,1), bty="n", mgp=c(2,.5,0))
for(l in levels(clinicalData$M_Risk)[c(2,4,3,1)]){
plot(NA,NA, ylim=c(0,1), xlab="Years", ylab="Mortality", xlim=c(0,10), yaxs='i', xaxs='i')
abline(h=seq(0.2,0.8,0.2),lty=1, col='lightgrey')
#abline(v=seq(1,9), col='lightgrey')
lines(survfit(Surv(time1/365, time2/365, status) ~ clinicalData$M_Risk[osData$index], data=osData, subset=clinicalData$M_Risk[osData$index]==l), col=riskCol[l],fun=function(x) 1-x ,mark=NA, lty=1, conf.int=FALSE)
rsKM <- survfit(Surv(time1/365, time2/365, rsStatus) ~ 1, data=osData, subset= clinicalData$M_Risk[osData$index]==l)
nrsKM <- survfit(Surv(time1/365, time2/365, status) ~ 1, data=nrdData, subset= clinicalData$M_Risk[nrdData$index]==l)
rsCR <- cumsum(c(1,diff(rsKM$surv)) * splinefun(nrsKM$time, nrsKM$surv, method="monoH.FC")(rsKM$time))
nrsCR <- cumsum(c(1,diff(nrsKM$surv)) * splinefun(rsKM$time, rsKM$surv, method="monoH.FC")(nrsKM$time))
lines(rsKM$time, 1-rsCR, col=riskCol[l], lty=2, type='s')
lines(nrsKM$time, 1-nrsCR, col=riskCol[l], lty=3, type='s')
if(i ==1)
legend(ifelse(i<=3,"topleft","bottomright"), c("total","relapse","non-rel"), lty=c(1,2,3), col="black", box.lty = 0, bg="white")
i <- i+1
mtext(l, side=3, font=2)
}
f <- function(x) 1-x
plot(survfit(Surv(time1/365, time2/365, status) ~ clinicalData$M_Risk[relData$index], data=relData), col=riskCol, ylab="CIR", xlab="Time after CR", main="Molecular risk groups, all cases", fun=f , ylim=c(0,1))
legend("bottomright", lty=1, bty="n", paste(levels(clinicalData$M_Risk), table(clinicalData$M_Risk[!is.na(c)])), col=riskCol)
Incidence of relapse v risk tercile
par(mfrow=c(2,2), mar=c(3,3,1,1), bty="n", mgp=c(2,.5,0))
riskCirTD <- coxRFXRelTD$Z %*% coef(coxRFXRelTD) - relData$transplantCR1 * coef(coxRFXRelTD)["transplantCR1"]
quantileRiskCirTD <- numeric(nrow(relData))
for(l in levels(clinicalData$M_Risk)[c(2,4,3,1)]){
w <- which(clinicalData$M_Risk[relData$index]==l)
q <- cut(riskCirTD[w], quantile(riskCirTD[w], seq(0,1,.33)), include.lowest=TRUE, labels=c("T1","T2","T3"))
quantileRiskCirTD[w] <- q
plot(NA,NA, ylab="CIR", main=paste(l, "terciles"), xlab="Years after CR", ylim=c(0,1), xlim=c(0,10), xaxs="i", yaxs="i")
#abline(h=seq(0.2,0.8,0.2),lty=1, col='lightgrey')
fit <- survfit(Surv(time1/365, time2/365, status) ~ q + transplantCR1, data=relData[w,])
## adjust for competing risk (NRM)
i <- c(0,diff(fit$surv))
s <- split(fit$surv, cumsum(i>0)) # split into strata
u <- split(fit$upper, cumsum(i>0)) # split into strata
v <- split(fit$lower, cumsum(i>0)) # split into strata
t <- split(fit$time, cumsum(i>0))
nrsKM <- survfit(Surv(time1/365, time2/365, status) ~ 1, data=nrdData, subset= clinicalData$M_Risk[nrdData$index]==l)
fit$surv <- unlist(sapply(seq_along(s), function(i) cumsum(c(1,diff(s[[i]])) * splinefun(nrsKM$time, nrsKM$surv, method="monoH.FC")(t[[i]])))) #adjust increments by nrs KM est
fit$lower <- unlist(sapply(seq_along(s), function(i) cumsum(c(1,diff(v[[i]])) * splinefun(nrsKM$time, nrsKM$surv, method="monoH.FC")(t[[i]])))) #adjust increments by nrs KM est
fit$upper <- unlist(sapply(seq_along(s), function(i) cumsum(c(1,diff(u[[i]])) * splinefun(nrsKM$time, nrsKM$surv, method="monoH.FC")(t[[i]])))) #adjust increments by nrs KM est
lines(fit, col=rep(sapply(2:0,function(x) colTrans(riskCol[l],x)), each=2), lty=c(1,0), mark=NA, xlab="Time after CR", fun=f)
#legend("bottomright", lty=c(1,3), bty="n", c("no TPL","TPL"), col=riskCol[l])
}
We use the survival package to compute the following mstate fits of CIR and NRM
t <- clinicalData$Recurrence_date
t[is.na(t)] <- clinicalData$Date_LF[is.na(t)]
time <- as.numeric(pmin(t, clinicalData$Date_LF) - clinicalData$CR_date)
status <- factor(ifelse(!is.na(clinicalData$Recurrence_date), "relapse", ifelse(clinicalData$Status==1,"dead","alive" )))
status[is.na(clinicalData$CR_date)] <- NA
alloCR1 <- 1:1540 %in% osData$index[osData$transplantCR1==1]
mSurv <- Surv(time/365.25, status, type="mstate")
f <- function(x) 1-x
par(mfrow=c(2,2), mar=c(3,3,1,1), bty="n", mgp=c(2,.5,0))
riskCir <- (coxRFXRelTD$Z %*% coef(coxRFXRelTD) - relData$transplantCR1 * coef(coxRFXRelTD)["transplantCR1"])[1:1540] # Risk w/o allograft
qtl <- numeric(nrow(dataFrame))
for(l in levels(clinicalData$M_Risk)[c(2,4,3,1)]){
w <- which(clinicalData$M_Risk==l)
q <- cut(riskCir[w], quantile(riskCir[w], seq(0,1,.33)), include.lowest=TRUE, labels=c("T1","T2","T3"))
qtl[w] <- q
plot(NA,NA, ylab="Fraction relapsed", main=paste0(l,", n=",sum(clinicalData$M_Risk[!is.na(mSurv)]==l, na.rm=TRUE)), xlab="Years after CR", ylim=c(0,1), xlim=c(0,5), xaxs="i", yaxs="i", font.main=1)
#abline(h=seq(0.2,0.8,0.2),lty=1, col='lightgrey')
fit <- survfit(mSurv~ qtl, subset= clinicalData$M_Risk==l)
lines(fit, col=sapply(2:0, function(x) c(colTrans(set1[2],x), colTrans(set1[5],x))), lty=c(1,1), mark=NA, xlab="Time after CR", fun=f)
#legend("bottomright", lty=c(1,3), bty="n", c("no TPL","TPL"), col=riskCol[l])
}
Overall survival after remission v risk tercile
par(mfrow=c(2,2), mar=c(3,3,1,1), bty="n", mgp=c(2,.5,0))
riskOsCR <- coxRFXOsCR$Z %*% coef(coxRFXOsCR) - osData$transplantCR1 * coef(coxRFXOsCR)["transplantCR1"]
quantileRiskOsCR <- numeric(nrow(osData))
for(l in levels(clinicalData$M_Risk)[c(2,4,3,1)]){
w <- which(clinicalData$M_Risk[osData$index]==l)
q <- cut(riskOsCR[w], quantile(riskOsCR[w], seq(0,1,.33)), include.lowest=TRUE, labels=c("T1","T2","T3"))
quantileRiskOsCR[w] <- q
plot(NA,NA, ylab="OS", main=paste(l, "terciles"), xlab="Years after CR", ylim=c(0,1), xlim=c(0,10), xaxs="i", yaxs="i")
abline(h=seq(0.2,0.8,0.2),lty=1, col='lightgrey')
fit <- survfit(Surv(time1/365, time2/365, status) ~ q + transplantCR1, data=osData[w,])
lines(fit, col=rep(sapply(2:0,function(x) colTrans(riskCol[l],x)), each=2), lty=c(1,3), mark=NA, xlab="Time after CR")
legend("bottomright", lty=c(1,3), bty="n", c("no TPL","TPL"), col=riskCol[l])
}
p <- lapply(levels(clinicalData$M_Risk), function(l) {
w <- which(clinicalData$M_Risk==l)
q <- cut(riskOsCR[w], quantile(riskOsCR[w], seq(0,1,.33)), include.lowest=TRUE, labels=c("T1","T2","T3"))
sapply(split(relData[w, names(crGroups)], q), colMeans)
})
names(p) <- levels(clinicalData$M_Risk)
par(mfrow=c(4,1), xpd=NA)
for(l in levels(clinicalData$M_Risk)[c(2,4,3,1)]){
t <- t((p[[l]]) * coef(coxRFXRelTD))[,crGroups != "Treatment"]
z <- (coef(coxRFXRelTD)/sqrt(diag(coxRFXRelTD$var2)))[crGroups != "Treatment"]
o <- order(z)
w <- c(1:15,ncol(t)-14:0)
b <- barplot(t[,o][,w], las=2, col=sapply(2:0,function(x) colTrans(riskCol[l],x)), beside=TRUE, ylim=c(-.5,.5), names.arg=rep("", length(w)))
rotatedLabel(b[2,],pmin(0,apply(t,2,min)[o][w]), colnames(t)[o][w])
s <- matrix(rep(sqrt(diag(coxRFXRelTD$var2)[crGroups != "Treatment"]), each=3) * t/rep(coef(coxRFXRelTD)[crGroups != "Treatment"], each=3), nrow=3)[,o][,w]
segments(b[1,], (colMeans(coxRFXRelTD$Z)*coef(coxRFXRelTD))[crGroups != "Treatment"][o][w] ,b[3,], (colMeans(coxRFXRelTD$Z)*coef(coxRFXRelTD))[crGroups != "Treatment"][o][w])
segments(b,t[,o][,w]-s, b,t[,o][,w]+s)
}
#p <- as.data.frame(PartialRisk(coxRFXRelTD)[1:nrow(clinicalData),])
partialRiskOsCR <- as.data.frame(PartialRisk(coxRFXOsCR)[1:nrow(clinicalData),])
s <- do.call("rbind",lapply(levels(clinicalData$M_Risk)[c(2,4,3,1)], function(l) {
w <- which(clinicalData$M_Risk==l)
q <- cut(riskOsCR[w], quantile(riskOsCR[w], seq(0,1,.33)), include.lowest=TRUE, labels=c("T1","T2","T3"))
t(sapply(split(partialRiskOsCR[w, ], q), colMeans) +.5 - colMeans(partialRiskOsCR))
}))
Risk constellation for OS after remission
c <- sapply(2:0, function(t) sapply(riskCol[c(2,4,3,1)], function(c) colTrans(c,t)))
g <- expand.grid(1:3,1:4-1)*3
stars(2*s[,c("Clinical","Demographics","Genetics","GeneGene","CNA","Fusions","Treatment")], scale=FALSE, col.stars=t(c), key.loc = c(13,0), locations=g, labels=NA)
symbols(g[,1], g[,2], circles=rep(1,12), inches=FALSE, add=TRUE)
text(1, 0:3*3, names(riskCol[c(2,4,3,1)]), pos=2)
text(1:3*3, 11, c("Best","Intermediate","Worst"), pos=3)
Prototypical risk constellations
prototypes <- sapply(levels(clinicalData$M_Risk)[c(2,4,3,1)], function(l) sapply(1:3, function(i){
#d <- dist(as.data.frame(coxRFXRelTD$Z[which(clinicalData$M_Risk[cirData$index]==l & quantileRiskCirTD==i &! is.na(clinicalData$CR_date[cirData$index])), ]))
w <- which(clinicalData$M_Risk[relData$index]==l & quantileRiskOsCR==i &! is.na(clinicalData$CR_date[relData$index]))
d <- dist(t(t(coxRFXOsCR$Z[w, ]) ))
osData$index[w][which.min(rowMeans(as.matrix(d), na.rm=TRUE))]
}))
c <- sapply(2:0, function(t) sapply(riskCol[c(2,4,3,1)], function(c) colTrans(c,t)))
g <- expand.grid(1:3,1:4-1)*3
stars(2*t(t(partialRiskOsCR[prototypes,])- colMeans(partialRiskOsCR))[,c("Clinical","Demographics","Genetics","CNA","Fusions","Treatment")] +1, scale=FALSE, col.stars=t(c), key.loc = c(13,0), locations=g, labels=NA)
symbols(g[,1], g[,2], circles=rep(1,12), inches=FALSE, add=TRUE)
text(1, 0:3*3, names(riskCol[c(2,4,3,1)]), pos=2)
text(1:3*3, 11, c("Low","Intermediate","High"), pos=3)
s <- partialRiskOsCR - rep(colMeans(partialRiskOsCR), each=nrow(partialRiskOsCR))
w <- sapply(split(1:1540, paste(clinicalData$M_Risk, quantileRiskOsCR[1:1540])), `[`, 1:12)
w <- w[,!grepl("NA", colnames(w))][,c(4:6,10:12,7:9,1:3)]
l <- stars(s[w,c("Demographics","Treatment","Fusions","CNA","Genetics","GeneGene","Clinical")] + .5, scale=FALSE, col.stars = mapply(function(i,j) {t <- try(c[i,j]); if(class(t)=="try-error") NA else t}, as.character(clinicalData$M_Risk[w]),quantileRiskCirTD[w]), labels="")
symbols(l[,1],l[,2], circles=rep(0.5, nrow(l)), inches=FALSE,add=TRUE)
layout(matrix(c(1:4), ncol=2),heights = c(10,1), widths = c(10,1))
partialRiskCirTD <- as.data.frame(PartialRisk(coxRFXRelTD))
s <- partialRiskCirTD[1:nrow(clinicalData),] - rep(colMeans(partialRiskCirTD), each=nrow(clinicalData))
u <- unique(relData$index[!is.na(relData$time2)])
w <- sapply(split(u, paste(clinicalData$M_Risk, quantileRiskCirTD[1:1540])[u]), `[`, 1:12)
w <- w[,!grepl("NA", colnames(w))][,c(4:6,10:12,7:9,1:3)]
i <- which(rev(!duplicated(rev(relData$index))))
m <- i[order(relData$index[i])]
c <- cut(relData$time2, quantile(relData$time2[m], seq(0,1,0.1), na.rm=TRUE))
l <- mg14:::stars(s[w,c("Demographics","Treatment","Fusions","CNA","Genetics","GeneGene","Clinical")] + .5, scale=FALSE, col.stars = brewer.pal(11,"RdBu")[-6][c[w]], labels="", density=ifelse(relData$status[m][w]==1,NA,48), col.lines=rep(1,(12^2)))
symbols(l[,1],l[,2], circles=rep(0.5, nrow(l)), inches=FALSE,add=TRUE, fg='lightgrey')
par(mar=c(2,2,0,2))
barplot(matrix(diff(quantile(relData$time2[m], na.rm=T, seq(0,1,0.1))), ncol=1)/365.25, col=brewer.pal(11,"RdBu")[-6], horiz=TRUE, border=NA, xlim=c(0,20))
Create a data.frame with all possibilities for allografts - none, CR1, after relapse.
w <- sort(unique(osData$index[which(quantileRiskOsCR==3 & clinicalData$M_Risk[osData$index]=="Favorable")]))
multiRFX3Tpl <- MultiRFX3(coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data=allDataTpl, x=3*365, prdData=prdData)
multiRFX3Tpl <- as.data.frame(matrix(multiRFX3Tpl$os, ncol=3, byrow=TRUE, dimnames=list(NULL, c("None","CR1","Relapse"))), row.names=rownames(dataFrame))
survivalTpl <- data.frame(multiRFX3Tpl, os=osYr, age=clinicalData$AOD, ELN=clinicalData$M_Risk, tercile=quantileRiskOsCR[1:nrow(multiRFX3Tpl)])
datatable(format(survivalTpl[order(survivalTpl$CR1 -survivalTpl$Relapse),], digits=4))
datatable(multiRFX3Tpl[patients,])
Function to predict OS from Relapse, PRS and NRM. This one also computes confidence intervals for each type of allograft and the predicted differences in outcome between allograft types.
MultiRFX3TplCi <- function(coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data, x =365, prdData, ciType="simulated", nSim = 200, mc.cores=10){
## Step 1: Compute KM survival curves and log hazard
getS <- function(coxRFX, data, max.x=5000) {
if(!is.null(coxRFX$na.action)) coxRFX$Z <- coxRFX$Z[-coxRFX$na.action,]
data <- as.matrix(data[,match(colnames(coxRFX$Z),colnames(data))])
r <- PredictRiskMissing(coxRFX, data, var="var2")
H0 <- basehaz(coxRFX, centered = FALSE)
hazardDist <- splinefun(H0$time, H0$hazard, method="monoH.FC")
x <- c(0:max.x,max.x)
S <- exp(-hazardDist(x))
return(list(S=S, r=r, x=x, hazardDist=hazardDist, r0 = coxRFX$means %*% coef(coxRFX)))
}
data$transplantCR1 <- 0
data$transplantRel <- 0
kmCir <- getS(coxRFX = coxRFXRelTD, data = data, max.x=max(x))
kmNrs <- getS(coxRFX = coxRFXNrdTD, data = data, max.x=max(x))
kmPrs <- getS(coxRFX = coxRFXPrdTD, data = data, max.x=max(x))
survPredict <- function(surv){
s <- survfit(surv~1)
splinefun(s$time, s$surv, method="monoH.FC")
}
xx <- 0:max(x)
# Baseline Prs (measured from relapse)
kmPrs0 <- survPredict(Surv(prdData$time1, prdData$time2, prdData$status))(xx)
# PRS baseline with spline-based dep on CR length)
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=prdData)
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
stopifnot(length(x)==1 | length(x) == nrow(data))
if(length(x)==nrow(data))
w <- match(x,kmCir$x)
else if(length(x)==1)
w <- rep(match(x, kmCir$x), nrow(data))
survival <- sapply(c("None","Rel","CR1"), function(type){
if(type=="None"){
data$transplantCR1 <- 0
data$transplantRel <- 0
}else if(type=="Rel"){
data$transplantCR1 <- 0
data$transplantRel <- 1
}else if(type=="CR1"){
data$transplantCR1 <- 1
data$transplantRel <- 0
}
kmCir <- getS(coxRFX = coxRFXRelTD, data = data, max.x=max(x))
kmNrm <- getS(coxRFX = coxRFXNrdTD, data = data, max.x=max(x))
kmPrs <- getS(coxRFX = coxRFXPrdTD, data = data, max.x=max(x))
## Step 2: Adjust CIR and NRM curve for competing risks, accounting for hazard
kmCir$Sadj <- sapply(1:nrow(data), function(i) cumsum(c(1,diff(kmCir$S^exp(kmCir$r[i,1]))) * kmNrm$S ^ exp(kmNrm$r[i,1])))
kmNrm$Sadj <- sapply(1:nrow(data), function(i) cumsum(c(1,diff(kmNrm$S^exp(kmNrm$r[i,1]))) * kmCir$S ^ exp(kmCir$r[i,1]))) ## array times x nrow(data)
y <- mapply(function(i,j) kmNrm$Sadj[i,j], w,1:length(w) ) # select time for each sample
nrs <- y
nrsUp <- y^exp(2*sqrt(kmNrm$r[,2]))
nrsLo <- y^exp(- 2*sqrt(kmNrm$r[,2]))
y <- mapply(function(i,j) kmCir$Sadj[i,j], w,1:length(w) ) # select time for each sample
cir <- y
cirLo <- y^exp( 2*sqrt(kmCir$r[,2]))
cirUp <- y^exp( - 2*sqrt(kmCir$r[,2]))
## Step 3: Compute post-relapse survival
rs <- sapply(1:nrow(data), function(i){
### Different approach
xLen <- 1+floor(x)
cir <- kmCir$Sadj[1:xLen,i]
rs <- computeTotalPrsC(x = xx, diffCir = diff(cir), prsP = kmPrs0, tdPrmBaseline = tdPrmBaseline, risk = kmPrs$r[i,1]-kmPrs$r0)
rs[xLen]
})
## Step 4: Combine into overall survival
if(any(1-(1-rs)-(1-nrs)<0)) warning("OS < 0 occured.")
os <- pmax(pmin(1-(1-rs)-(1-nrs),1),0)
cbind(os, rs, nrs, aar=rs-cir)
}, simplify='array')
## Step 5: Confidence intervals for OS
osCi <- sapply(mclapply(1:nrow(data), function(i){
{
## Simulate CI
osCiMc <- sapply(1:nSim, function(foo){
r0 <- rnorm(3,c(kmCir$r[i,1],kmNrs$r[i,1],kmPrs$r[i,1]),sqrt(c(kmCir$r[i,2],kmNrs$r[i,2],kmPrs$r[i,2])))
H0 <- exp(r0)
nrs0 <- cumsum(c(1,diff(kmNrs$S^H0[2])) * kmCir$S^H0[1]) ## Correct KM estimate for competing risk
diffCir <- diff(c(1,kmCir$S^H0[1])) * kmNrs$S^H0[2] ## Correct KM estimate for competing risk
cir0 <- 1+cumsum(diffCir)
rs0 <- computeTotalPrsC(x = xx, diffCir = diffCir, prsP = kmPrs0, tdPrmBaseline = tdPrmBaseline, risk = -kmPrs$r0+log(H0[3]))
aar0 <- rs0[1:w[i]]-cir0[1:w[i]]
Hcr1 <- exp(r0 + rnorm(3,c(coxRFXRelTD$coefficients["transplantCR1"],coxRFXNrdTD$coefficients["transplantCR1"],coxRFXPrdTD$coefficients["transplantCR1"]),
sqrt(c(coxRFXRelTD$var2["transplantCR1","transplantCR1"],coxRFXNrdTD$var2["transplantCR1","transplantCR1"],coxRFXPrdTD$var2["transplantCR1","transplantCR1"]))))
nrsCr1 <- cumsum(c(1,diff(kmNrs$S^Hcr1[2])) * kmCir$S^Hcr1[1]) ## Correct KM estimate for competing risk
diffCir <- diff(c(1,kmCir$S^Hcr1[1])) * kmNrs$S^Hcr1[2] ## Correct KM estimate for competing risk
cirCr1 <- 1+cumsum(diffCir)
rsCr1 <- computeTotalPrsC(x = xx, diffCir = diffCir, prsP = kmPrs0, tdPrmBaseline = tdPrmBaseline, risk = -kmPrs$r0+log(Hcr1[3]))
aarCr1 <- rsCr1[1:w[i]]-cirCr1[1:w[i]]
Hrel <- exp(r0 + rnorm(3,c(coxRFXRelTD$coefficients["transplantRel"],coxRFXNrdTD$coefficients["transplantRel"],coxRFXPrdTD$coefficients["transplantRel"]),
sqrt(c(coxRFXRelTD$var2["transplantRel","transplantRel"],coxRFXNrdTD$var2["transplantRel","transplantRel"],coxRFXPrdTD$var2["transplantRel","transplantRel"]))))
nrsRel <- cumsum(c(1,diff(kmNrs$S^Hrel[2])) * kmCir$S^Hrel[1]) ## Correct KM estimate for competing risk
diffCir <- diff(c(1,kmCir$S^Hrel[1])) * kmNrs$S^Hrel[2] ## Correct KM estimate for competing risk
cirRel <- 1+cumsum(diffCir)
rsRel <- computeTotalPrsC(x = xx, diffCir = diffCir, prsP = kmPrs0, tdPrmBaseline = tdPrmBaseline, risk = -kmPrs$r0+log(Hrel[3]))
aarRel <- rsRel[1:w[i]]-cirRel[1:w[i]]
os0 <- (1-(1-nrs0[1:w[i]])-(1-rs0))[w[i]]
osCr1 <- (1-(1-nrsCr1[1:w[i]])-(1-rsCr1))[w[i]]
osRel <- (1-(1-nrsRel[1:w[i]])-(1-rsRel))[w[i]]
return(cbind(os=c(none=os0, cr1=osCr1, rel=osRel, dCr1=osCr1-os0, dRel=osRel-os0, dCr1Rel=osCr1-osRel),
rs=c(none=rs0[w[i]], cr1=rsCr1[w[i]], rel=rsRel[w[i]], dCr1=rsCr1[w[i]]-rs0[w[i]], dRel=rsRel[w[i]]-rs0[w[i]], dCr1Rel=rsCr1[w[i]]-rsRel[w[i]]),
nrs=c(none=nrs0[w[i]], cr1=nrsCr1[w[i]], rel=nrsRel[w[i]], dCr1=nrsCr1[w[i]]-nrs0[w[i]], dRel=nrsRel[w[i]]-nrs0[w[i]], dCr1Rel=nrsCr1[w[i]]-nrsRel[w[i]]),
aar=c(none=aar0[w[i]], cr1=aarCr1[w[i]], rel=aarRel[w[i]], dCr1=aarCr1[w[i]]-aar0[w[i]], dRel=aarRel[w[i]]-aar0[w[i]], dCr1Rel=aarCr1[w[i]]-aarRel[w[i]])))
}, simplify='array')
osCiMcQ <- apply(osCiMc,1:2,quantile, c(0.025,0.5,0.975))
return(sapply(c("os","rs","nrs","aar"), function(t)
cbind(hat = c(survival[i,t,1], survival[i,t,3], survival[i,t,2], survival[i,t,3]-survival[i,t,1], survival[i,t,2]-survival[i,t,1], survival[i,t,3]-survival[i,t,2]),
median = osCiMcQ[2,,t], lower = osCiMcQ[1,,t], upper = osCiMcQ[3,,t]), simplify="array"))
}
}, mc.cores=mc.cores), I, simplify="array")
#cat(os, "\n")
return(osCi)
}
set.seed(42)
d <- osData[1:nrow(dataFrame),]
d$transplantCR1 <- 0
d$transplantRel <- 0
p <- grep("PD11104a|PD8314a|PD10941a",rownames(dataFrame))
predict3 <- MultiRFX3TplCi(coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data=d[p,colnames(coxRFXNrdTD$Z)], x=3*365, nSim=1000, prdData=prdData) ## selected with 1000
dimnames(predict3)[[4]] <- rownames(dataFrame)[p]
predict3
## , , os, PD10941a
##
## hat median lower upper
## none 0.818663673 0.813758093 0.735781209 0.87846475
## cr1 0.837567294 0.833837913 0.728221120 0.89613925
## rel 0.841335270 0.837729406 0.758152717 0.89684242
## dCr1 0.018903622 0.016563073 -0.043991177 0.06893157
## dRel 0.022671597 0.021987031 -0.006177209 0.05224429
## dCr1Rel -0.003767975 -0.005622502 -0.067644137 0.04502109
##
## , , rs, PD10941a
##
## hat median lower upper
## none 0.88114586 0.88071728 0.8028936741 0.93390699
## cr1 0.94463616 0.94507091 0.8999107697 0.97099932
## rel 0.90381746 0.90559639 0.8318637923 0.95154789
## dCr1 0.06349030 0.06276975 0.0330569278 0.10630881
## dRel 0.02267160 0.02226157 0.0002740095 0.05254428
## dCr1Rel 0.04081871 0.03892768 0.0112503418 0.08418140
##
## , , nrs, PD10941a
##
## hat median lower upper
## none 0.93751781 0.9377259143 0.87950077 0.96807394
## cr1 0.89293113 0.8921954102 0.79052095 0.94496498
## rel 0.93751781 0.9369624377 0.87673751 0.96928132
## dCr1 -0.04458668 -0.0447622293 -0.10283944 -0.01614686
## dRel 0.00000000 -0.0003878613 -0.01799561 0.01482152
## dCr1Rel -0.04458668 -0.0439414652 -0.10433321 -0.01148903
##
## , , aar, PD10941a
##
## hat median lower upper
## none 0.16778126 0.16602329 0.098569191 0.26404124
## cr1 0.05952954 0.05832456 0.032368604 0.10513500
## rel 0.19045286 0.18806412 0.114633337 0.30149402
## dCr1 -0.10825172 -0.10691238 -0.163826806 -0.06454859
## dRel 0.02267160 0.02286218 -0.004982565 0.05963474
## dCr1Rel -0.13092332 -0.12981706 -0.207085090 -0.07409337
##
## , , os, PD11104a
##
## hat median lower upper
## none 0.84056298 0.83639387 0.752890309 0.89841989
## cr1 0.88621933 0.88221026 0.800876178 0.93160969
## rel 0.86553736 0.86095632 0.779960110 0.91781550
## dCr1 0.04565635 0.04427444 -0.005897815 0.09407140
## dRel 0.02497437 0.02388986 -0.002510175 0.05296898
## dCr1Rel 0.02068197 0.01937872 -0.029100606 0.07237330
##
## , , rs, PD11104a
##
## hat median lower upper
## none 0.87009314 0.87005765 0.7878001759 0.92237900
## cr1 0.93795534 0.93869458 0.8918465140 0.96652262
## rel 0.89506752 0.89493997 0.8132439573 0.94137070
## dCr1 0.06786219 0.06762730 0.0365378460 0.11655265
## dRel 0.02497437 0.02414719 -0.0007901814 0.05285589
## dCr1Rel 0.04288782 0.04308201 0.0133882399 0.09190801
##
## , , nrs, PD11104a
##
## hat median lower upper
## none 0.97046984 0.9698761766 0.92402463 0.988322758
## cr1 0.94826399 0.9474848018 0.86036996 0.979862354
## rel 0.97046984 0.9690673487 0.91721959 0.988712274
## dCr1 -0.02220585 -0.0224939417 -0.06600414 -0.006582919
## dRel 0.00000000 -0.0001226097 -0.01089082 0.009718243
## dCr1Rel -0.02220585 -0.0216192196 -0.06747818 -0.005121411
##
## , , aar, PD11104a
##
## hat median lower upper
## none 0.19080344 0.18803262 0.117200743 0.28859824
## cr1 0.06981669 0.06730478 0.036730974 0.12009207
## rel 0.21577782 0.21368379 0.136124880 0.32761145
## dCr1 -0.12098675 -0.11886572 -0.178730740 -0.07408171
## dRel 0.02497437 0.02483472 -0.004621345 0.06238499
## dCr1Rel -0.14596112 -0.14404519 -0.219428528 -0.09029031
##
## , , os, PD8314a
##
## hat median lower upper
## none 0.48820503 0.4881795 0.346459914 0.6095472
## cr1 0.67849561 0.6772427 0.551822750 0.7668568
## rel 0.54208498 0.5424903 0.395004579 0.6651239
## dCr1 0.19029058 0.1856732 0.101625028 0.2656564
## dRel 0.05387995 0.0522543 -0.006530565 0.1162451
## dCr1Rel 0.13641063 0.1321234 0.036290833 0.2415983
##
## , , rs, PD8314a
##
## hat median lower upper
## none 0.53198035 0.53396623 0.391145882 0.6595976
## cr1 0.76445697 0.76880775 0.658094276 0.8472067
## rel 0.58586030 0.58956092 0.437185335 0.7242483
## dCr1 0.23247662 0.22964120 0.159395090 0.3104094
## dRel 0.05387995 0.05361889 -0.007811998 0.1136675
## dCr1Rel 0.17859667 0.17710282 0.086782790 0.2753473
##
## , , nrs, PD8314a
##
## hat median lower upper
## none 0.95622468 0.9564742160 0.91389070 0.97815004
## cr1 0.91403863 0.9143142553 0.82517813 0.95636212
## rel 0.95622468 0.9559808647 0.91109939 0.97855350
## dCr1 -0.04218604 -0.0418897124 -0.09626402 -0.01503247
## dRel 0.00000000 -0.0001838208 -0.01385361 0.01141869
## dCr1Rel -0.04218604 -0.0416301733 -0.09778504 -0.01323211
##
## , , aar, PD8314a
##
## hat median lower upper
## none 0.13298991 0.13096528 0.061187235 0.22304131
## cr1 0.04916117 0.04820070 0.020633949 0.10185429
## rel 0.18686985 0.18489281 0.089794282 0.30353042
## dCr1 -0.08382874 -0.08014001 -0.141322138 -0.03631045
## dRel 0.05387995 0.05032049 0.008146389 0.10664127
## dCr1Rel -0.13770869 -0.13275091 -0.218742006 -0.06246168
set.seed(42)
multiRFX3TplCi <- MultiRFX3TplCi(coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data=d[,colnames(coxRFXNrdTD$Z)], x=3*365, nSim=200, prdData=prdData) ## others with 200
dimnames(multiRFX3TplCi)[[4]] <- rownames(dataFrame)
The following shows boxplots of the mortality reduction v the risk terciles.
par(mar=c(7,5,1,1))
f <- factor(clinicalData$M_Risk, levels=levels(clinicalData$M_Risk)[c(2,4,3,1)])
boxplot(multiRFX3Tpl$CR1 - multiRFX3Tpl$None ~ quantileRiskOsCR[1:1540] + f, las=2, col=t(outer(riskCol[c(2,4,3,1)], 2:0, colTrans)), ylab="Survival gain TPL CR1 at 3yr")
boxplot(multiRFX3Tpl$Relapse - multiRFX3Tpl$None ~ quantileRiskOsCR[1:1540] + f, las=2, col=t(outer(riskCol[c(2,4,3,1)], 2:0, colTrans)), ylab="Survival gain TPL Relapse at 3yr")
boxplot(multiRFX3Tpl$CR1 - multiRFX3Tpl$Relapse ~ quantileRiskOsCR[1:1540] + f, las=2, col=t(outer(riskCol[c(2,4,3,1)], 2:0, colTrans)), ylab="Survival gain TPL in CR1 over salvage at 3yr")
abline(h=0)
Mortality reduction v age
par(c(3,3,1,1))
## NULL
y <- multiRFX3Tpl$CR1 - multiRFX3Tpl$None
x <- dataFrame$AOD_10*10
plot(y ~ x)
lines(lowess(x[x<60], y[x<60]), col="green")
Note: The jump after 60 arises from patients after 60 in AMLHD98B not having received allografts. Based on the trial stratum they are hence (incorrectly) predicted to have very low non-relapse mortality upon allograft. However, this doesn’t affect novel patients.
plot(multiRFX3Tpl$CR1 - multiRFX3Tpl$None ~ predict(coxRFXOsCR, newdata=osData[1:1540,]), xlab="Risk", ylab="Survival gain TPL CR1 at 1000d")
lines(lowess(predict(coxRFXOsCR, newdata=osData[1:1540,]), multiRFX3Tpl$CR1 - multiRFX3Tpl$None), col='green')
We compute LOO out-of-sample predictions for the survival gain by allograft in CR1 v relapse by training 1540 models on 1539 patients each.
multiRFX3TplCiLoo <- sapply(mclapply(rownames(dataFrame), function(pd){
e <- new.env()
i <- which(rownames(dataFrame)==pd)
whichTrain <<- which(rownames(dataFrame)!=pd)
load(paste0("../code/loo/",i,".RData"), env=e)
multiRFX3TplCi <- MultiRFX3TplCi(e$rfxNrs, e$rfxRel, e$rfxPrs, data=data[i,colnames(e$rfxPrs$Z), drop=FALSE], x=3*365, nSim=200, prdData=prdData[prdData$index!=i,], mc.cores=1)
}, mc.cores=10), I, simplify="array")[,,,1,]
This we compare to in-sample predictions of the model trained on all 1540 patients.
plot(multiRFX3TplCiLoo["dCr1Rel","hat","os",],multiRFX3TplCi["dCr1Rel","hat","os",] )
cor(multiRFX3TplCiLoo["dCr1Rel","hat","os",], multiRFX3TplCi["dCr1Rel","hat","os",] )
## [1] 0.9815685
multiRFX3TplLoo <- t(multiRFX3TplCiLoo[1:3,"hat","os",])
The correlation of in-sample and out-of-sample predictions is very high, but there are some differences. We now assess the accuracy of our predictions by comparing the observed survival with the out-of-sample prediction. To this end, we split out the quarter of patients predicted to benefit the most. In both subsets we compare the observed 3yr survial between patients with and without allograft in CR1 and compute the difference. CIs by boostrapping.
d <- multiRFX3TplLoo[,2]-multiRFX3TplLoo[,3]
w <- which(clinicalData$AOD < 60)
q <- c(min(d), 0.1, max(d))
c <- cut(d, breaks=q, include.lowest=TRUE)# , paste0("[",names(q)[-length(q)],",",names(q)[-1],")"))
e <- sapply(levels(c),
function(cc) {
t <- try(survfit(Surv(time1, time2, status) ~ transplantCR1, data=osData, subset=c[osData$index]==cc & osData$index %in% w));
if(class(t)=="try-error")
rep(NA,2)
else {
s <- summary(t, time=3*365)
if(length(s$surv)==2) {
ci <- sapply(1:200, function(foo){
set.seed(foo)
b <- sample(1:nrow(dataFrame), replace=TRUE)
diff(summary(survfit(Surv(time1, time2, status) ~ transplantCR1, data=osData, subset=c[osData$index]==cc & osData$index %in% w & osData$index %in% b), time=3*365)$surv)
})
r <- c(diff(s$surv), quantile(ci, c(0.025, 0.975)))
names(r) <- c("delta", "lower",'upper')
r
}
else rep(NA,3)
}})
x <- sapply(split(d[w],c[w]),median)
par(xpd=NA, bty="L")
plot(x,e[1,], pch=19, xlim=c(-.05,.2), ylim=c(-.05,.2), xlab = "Predicted survival benefit", ylab="Observed survival benefit (leave-one-out CV)")
h <- density(d[w])
y <- h$y/diff(range(h$y))*.05 + par("usr")[3]
v <- h$x <= q[2]
par(xpd=FALSE)
polygon(c(h$x[v], h$x[which(v)[length(which(v))]]), c(y[v],par("usr")[3]), border=NA, col=set1[1])
polygon(c(h$x[which(!v)[1]], h$x[!v]), c(par("usr")[3],y[!v]), border=NA, col=set1[3])
lines(h$x, y)
segments(x,e[2,],x,e[3,])
#rug(d, col="#00000022")
abline(0,1, lty=3)
We can explore the the hypothetical survival gain if each patient had received the optimal treatment strategy. First, observed outcome for patients < 60yr with CR
summary(survfit(Surv(time1, time2, status) ~ 1, data=osData, subset=AOD_10 < 6), time=3*365)
## Call: survfit(formula = Surv(time1, time2, status) ~ 1, data = osData,
## subset = AOD_10 < 6)
##
## 176 observations deleted due to missingness
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1095 605 455 0.584 0.0149 0.555 0.614
Compared to predictions, as treated
mean(sapply((1:nrow(data))[!is.na(clinicalData$CR_date) & clinicalData$AOD < 60], function(i) multiRFX3TplLoo[i, 1+data[i,"transplantCR1"] + 2*data[i, "transplantRel"] ]))
## [1] 0.5865632
The observed outcome and predictions are consistent. Now explore what would have been the best option for each patient:
apply(apply(-multiRFX3TplLoo[!is.na(clinicalData$CR_date) & clinicalData$AOD < 60,],1,rank),1,function(x) table(factor(x, levels=1:3)))
## none cr1 rel
## 1 0 861 248
## 2 92 156 861
## 3 1017 92 0
The same as above, split by ELN risk categories:
table(clinicalData$M_Risk[!is.na(clinicalData$CR_date) & clinicalData$AOD < 60], factor(apply(multiRFX3TplLoo[!is.na(clinicalData$CR_date) & clinicalData$AOD < 60,], 1, which.max), levels=1:3, labels=colnames(multiRFX3TplLoo)))[c(2,4,3,1),]
##
## none cr1 rel
## Favorable 0 276 120
## Inter-2 0 137 49
## Inter-1 0 245 51
## Adverse 0 121 24
Split by ELN risk, requiring TPL in CR1 to offer 5% advantage over salvage
table(clinicalData$M_Risk, apply(multiRFX3TplLoo, 1, function(x) x[2] > x[3]+.05))[c(2,4,3,1),]
##
## FALSE TRUE
## Favorable 376 97
## Inter-2 167 101
## Inter-1 210 207
## Adverse 101 152
Compute the extrapolated survival under each of the different scenarios:
colMeans(multiRFX3TplLoo[!is.na(clinicalData$CR_date) & clinicalData$AOD < 60,])
## none cr1 rel
## 0.5546881 0.6305105 0.5920412
Best possible - everyone had received the optimal strategy:
mean(apply(multiRFX3TplLoo[!is.na(clinicalData$CR_date) & clinicalData$AOD < 60,],1,max))
## [1] 0.6368276
patients <- c("PD11104a","PD8314a","PD10941a")
threePatientTplCiLoo <- sapply(patients, function(pd){
e <- new.env()
i <- which(rownames(dataFrame)==pd)
whichTrain <<- which(rownames(dataFrame)!=pd)
load(paste0("../code/loo/",i,".RData"), env=e)
multiRFX3TplCi <- MultiRFX3TplCi(e$rfxNrs, e$rfxRel, e$rfxPrs, data=data[i,colnames(e$rfxPrs$Z), drop=FALSE], x=3*365, nSim=1000, prdData=prdData[prdData$index!=i,], mc.cores=5)
}, simplify="array")
The following code is run on the cluster
nSim <- 200
read_chunk('../code/ciCor.R', labels="ciCor")
load("ciCor.RData")
library(mg14)
library(CoxHD)
library(Rcpp)
#save(allDataTpl, coxRFXNrdTD, coxRFXPrdTD, coxRFXRelTD, MultiRFX3, prdData, relData, nrdData, crGroups, nSim, file="../code/ciCor.RData")
cppFunction('NumericVector computeTotalPrsC(NumericVector x, NumericVector diffCir, NumericVector prsP, NumericVector tdPrmBaseline, double risk) {
int xLen = x.size();
double hj;
double r = exp(risk);
NumericVector rs(xLen);
for(int i = 0; i < xLen; ++i) rs[i] = 1;
for(int j = 1; j < xLen; ++j){
hj = tdPrmBaseline[j-1] * r;
for(int i = j; i < xLen; ++i){
rs[i] += diffCir[j-1] * (1-pow(prsP[i-j], hj));
}
}
return rs;
}', rebuild=TRUE)
jobIndex <- as.numeric(Sys.getenv("LSB_JOBINDEX"))
load(paste0("loo/",jobIndex,".RData"))
cvIdx <- 1:nrow(dataFrame)
whichTrain <- which(cvIdx != jobIndex)
multiRFX3TplCiCorLoo <- sapply(1:nSim, function(foo){
set.seed(foo)
cNrd <- rfxNrs
cNrd$coefficients <- mvtnorm::rmvnorm(1, mean=cNrd$coefficients, sigma=coxRFXNrdTD$var2, method="chol")[1,]
cRel <- rfxRel
cRel$coefficients <- mvtnorm::rmvnorm(1, mean=cRel$coefficients, sigma=coxRFXRelTD$var2, method="chol")[1,]
cPrd <- rfxPrs
cPrd$coefficients <- mvtnorm::rmvnorm(1, mean=cPrd$coefficients, sigma=coxRFXPrdTD$var2, method="chol")[1,]
multiRFX3Tpl <- as.matrix(MultiRFX3(cNrd, cRel, cPrd, data=allDataTpl[3*jobIndex + (-2:0),], x=3*365, prdData=prdData[prdData$index %in% whichTrain,])[,c("os","cir","nrs","rs")])
rownames(multiRFX3Tpl) <- c("None","CR1","Relapse")
return(multiRFX3Tpl)
}, simplify='array')
save(multiRFX3TplCiCorLoo, file=paste0("ciCorLoo/",jobIndex,".RData"))
Collect the results
multiRFX3TplCiCorLoo <- simplify2array(mclapply(1:nrow(dataFrame), function(foo) try({
e <- new.env()
load(paste0("../code/ciCorLoo/",foo,".RData"), envir=e)
return(e$multiRFX3TplCiCorLoo)
}), mc.cores=4))
The figure shows the mortality reduction of allograft CR1 v none, allograft in Rel v none, and CR1 v Relapse, for LOO predictions similar to above.
par(mar=c(3,3,1,3), las=2, mgp=c(2,.5,0), bty="n")
benefit <- multiRFX3TplLoo[,2]-multiRFX3TplLoo[,3]
benefitGroup <- factor(benefit > 0.1, labels=c("Low","High"))
absrisk <- multiRFX3TplLoo[,1]
names(absrisk) <- names(benefit) <- rownames(dataFrame)
s <- clinicalData$AOD < 60 & !is.na(clinicalData$CR_date) &! clinicalData$TPL_Phase %in% c("RD1","PR1")
x <- 1-absrisk
y <- benefit
plot(x[s], y[s], pch=NA, ylab="Mortality reduction from allograft", xlab="3yr mortality with standard chemo", col=riskCol[clinicalData$M_Risk], cex=.8, las=1, ylim=range(benefit))
abline(h=seq(-.1,.3,.1), col='grey', lty=3)
abline(v=seq(.2,.9,0.2), col='grey', lty=3)
points(x[s], y[s], pch=16, col=riskCol[clinicalData$M_Risk[s]], cex=.8)
segments(1-threePatientTplCiLoo["none","lower","os",1,patients], y[patients],1-threePatientTplCiLoo["none","upper","os",1,patients],y[patients])
segments(x[patients], threePatientTplCiLoo["dCr1Rel","lower","os",1,patients],x[patients], threePatientTplCiLoo["dCr1Rel","upper","os",1,patients])
# Add loess fit, accounting for correlations of errors
xn <- seq(0.01,0.99,0.01)
fit <- sapply(1:nSim, function(i){
benefit <- multiRFX3TplCiCorLoo[2,"os",i,]-multiRFX3TplCiCorLoo[3,"os",i,]
absrisk <- multiRFX3TplCiCorLoo[1,"os",i,]
s <- clinicalData$AOD < 60 & !is.na(clinicalData$CR_date) &! clinicalData$TPL_Phase %in% c("RD1","PR1")
x <- 1-absrisk
y <- benefit
p <- predict(loess(y~x, data=data.frame(x=x[s], y=y[s])), newdata=data.frame(x=xn), se=TRUE)
yn <- c(p$fit + 2*p$se.fit,rev(p$fit - 2*p$se.fit))
p$fit
})
q <- apply(fit, 1, quantile, c(0.025, 0.5, 0.975), na.rm=TRUE)
polygon(c(xn, rev(xn)), c(q[1,], rev(q[3,])), border=NA, col="#00000044", lwd=1)
lines(xn, rowMeans(fit, na.rm=TRUE))
legend("topleft", pch=c(16,16,16,16,NA),lty=c(NA,NA,NA,NA,1), col=c(riskCol[c(2,4,3,1)],1),fill=c(NA,NA,NA,NA,"grey"), border=NA, c(names(riskCol)[c(2,4,3,1)],"loess average"), box.lty=0)
n <- c(100,50,20,10,5,4,3)
axis(side=4, at=1/n, labels=n, las=1)
mtext("Number needed to treat", side=4, at=.2, line=2, las=0)
axis(side=4, at=-1/n, labels=n, las=1)
mtext("Number needed to harm", side=4, at=-.1, line=2, las=0)
Violins plot of the predicted survival gain
par(mar=c(3,3,1,1), mgp=c(2,0.5,0), bty="n")
h <- density(benefit[s])
y <- h$y/diff(range(h$y))*.05 + par("usr")[3]
par(xpd=FALSE)
xx <- c(h$x, rev(h$x))
yy <- c(h$y, -rev(h$y))
v <- xx <= 0.1 #q[2]
plot(yy,xx, pch=NA, ylab="Predicted benefit", xlab="", xaxt="n", ylim=range(benefit))
polygon(yy[v], xx[v], border=NA, col=set1[1])
polygon(yy[!v], xx[!v], border=NA, col=set1[2])
lines(yy, xx)
KM plot of the high v low benefit groups
par(mar=c(3,3,1,1), mgp=c(2,0.5,0), bty="L")
f <- survfit(Surv(time1/365, time2/365, status) ~ group + transplantCR1, data=cbind(osData, group=benefitGroup[osData$index]), subset=osData$index %in% which(s) & !clinicalData$M_Risk[osData$index] %in% c("Favorable"))
summary(f, time=3)
## Call: survfit(formula = Surv(time1/365, time2/365, status) ~ group +
## transplantCR1, data = cbind(osData, group = benefitGroup[osData$index]),
## subset = osData$index %in% which(s) & !clinicalData$M_Risk[osData$index] %in%
## c("Favorable"))
##
## group=Low, transplantCR1=0
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 3.000 133.000 142.000 0.506 0.030 0.450 0.568
##
## group=Low, transplantCR1=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 3.0000 119.0000 86.0000 0.5499 0.0386 0.4792 0.6310
##
## group=High, transplantCR1=0
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 3.0000 14.0000 44.0000 0.2722 0.0594 0.1775 0.4174
##
## group=High, transplantCR1=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 3.0000 21.0000 32.0000 0.3955 0.0671 0.2835 0.5516
plot(f, col=rep(pastel1[1:nlevels(benefitGroup)],each=2), lty=rep(c(1,2), nlevels(benefitGroup)), xlab="TIme after CR", ylab="Survival", xlim=c(0,5), cex=.5)
t <- table(which(s) %in% osData$index[osData$transplantCR1==1],benefitGroup[s],!is.na(clinicalData$CR_date[s]), !clinicalData$M_Risk[s] %in% c("Favorable"))[,,"TRUE","TRUE"]
legend("topright", legend=paste(rep(levels(benefitGroup), each=2), rep(c("no HSCT","w. HSCT"), 2), as.numeric(t), sep=", "), col=rep(pastel1[1:nlevels(benefitGroup)],each=2), lty=rep(c(1,2), nlevels(benefitGroup)), bty="n")
KM plot of the high v low benefit groups
par(mar=c(3,3,2,1), mgp=c(2,0.5,0), bty="L", mfrow=c(3,2), cex=1)
e <- factor(paste(clinicalData$M_Risk))
for(l in levels(e)){
f <- survfit(Surv(time1/365, time2/365, status) ~ group + transplantCR1, data=cbind(osData, group=benefitGroup[osData$index]), subset=osData$index %in% which(s) & e[osData$index] == l)
summary(f, time=3)
plot(f, col=rep(pastel1[1:nlevels(benefitGroup)],each=2), lty=rep(c(1,2), nlevels(benefitGroup)), xlab="TIme after CR", ylab="Survival", xlim=c(0,5), cex=.5, main=l)
t <- table(which(s) %in% osData$index[osData$transplantCR1==1],benefitGroup[s],!is.na(clinicalData$CR_date[s]), e[s])[,,"TRUE",l]
legend("topright", legend=paste(rep(levels(benefitGroup), each=2), rep(c("no HSCT","w. HSCT"), 2), as.numeric(t), sep=", "), col=rep(pastel1[1:nlevels(benefitGroup)],each=2), lty=rep(c(1,2), nlevels(benefitGroup)), bty="n")
}
The following plot shows the hypothetical population-level survival gains of the knowledge bank.
par(bty="L")
s <- clinicalData$AOD < 60 & !is.na(clinicalData$CR_date) & !clinicalData$TPL_Phase %in% c("PR1","RD1")
fAlloRelapse <- sum(prdData$transplantRel & s[!is.na(clinicalData$Recurrence_date)][prdData$index])/sum(relData$status & !relData$transplantCR1 & s[relData$index]) # fraction of patients that have received a salvage transplant
benefitAllo <- multiRFX3TplLoo[,"cr1"] - (fAlloRelapse*multiRFX3TplLoo[,"rel"] +(1-fAlloRelapse)*multiRFX3TplLoo[,"none"])
o <- order(-benefitAllo + ifelse(!s ,NA,0), na.last=NA)
pRelapse <- 1+multiRFX3TplCiLoo[1:2,1,"aar",] - multiRFX3TplCiLoo[1:2,1,"rs",] ## Relapse probabilities
fRelapse <- sapply(seq_along(o), function(i) mean(c(pRelapse[2,o[1:i]], pRelapse[1,o[-(1:i)]]), na.rm=TRUE)) # Personalised
sIdeal <- sapply(seq_along(o), function(i) mean(c(multiRFX3TplLoo[o[1:i],"cr1"], (1-fAlloRelapse)*multiRFX3TplLoo[o[-(1:i)],"none"] + fAlloRelapse*multiRFX3TplLoo[o[-(1:i)],"rel"]), na.rm=TRUE))
x <- seq_along(sIdeal)/length(sIdeal)
plot(x + (1-x)*fRelapse*fAlloRelapse,sIdeal, type='l', xlab="Total fraction of allografts", ylab="Survival of eligible patients 3yrs after CR", col=set1[1], xaxs="i", yaxs="i", lty=1)
p <- order(na.zero(c(1,4,2,3)[clinicalData$M_Risk]) + dataFrame$AOD_10/20 + ifelse(!s,NA,0) + ifelse(clinicalData$AOD>=60,NA,0), na.last=NA)
fRelapseEln <- sapply(seq_along(p), function(i) mean(c(pRelapse[2,p[1:i]], pRelapse[1,p[-(1:i)]]), na.rm=TRUE)) # ELN
sEln <- sapply(seq_along(p), function(i) mean(c(multiRFX3TplLoo[p[1:i],"cr1"], (1-fAlloRelapse)*multiRFX3TplLoo[p[-(1:i)],"none"] + fAlloRelapse*multiRFX3TplLoo[p[-(1:i)],"rel"]), na.rm=TRUE))
x <- seq_along(sEln)/length(sEln)
lines(x + (1-x)*fRelapseEln*fAlloRelapse,sEln, sEln, type='l', col=set1[2])
legend("bottomright", c("Priorisation based on", "Knowledge bank", "ELN and age"), col=set1[c(NA,1,2)],lty=c(NA,1,1), bty="n", text.font=c(2,1,1))
Total numbers of transplants
fAlloCR1 <- 0.3 ## Assume 30% allografts in CR1
i <- which(x > fAlloCR1)[1] - 1
c(`Knowlege bank`=(x + (1-x)*fRelapse*fAlloRelapse)[i], ELN=(x + (1-x)*fRelapseEln*fAlloRelapse)[i])
## Knowlege bank ELN
## 0.4358391 0.4380818
Projected survival at 3yrs
c(ELN=sEln[i], `Knowledge bank`=sIdeal[i])
## ELN Knowledge bank
## 0.6117393 0.6250449
Achieve same survival as ELN with the following number of allografts
j <- c(`Knowledge bank`=which(sIdeal >= sEln[i])[1]-1)
fAlloCR1Pers <- (x + (1-x)*fRelapse*fAlloRelapse)[j]
names(fAlloCR1Pers) <- names(j)
fAlloCR1Pers
## Knowledge bank
## 0.3392658
As there is some uncertainty related to the overall benefit of early vs late allografts, the following plots show the benefit at the extremes of the expected distribution. Plots are shown for the 5%, 50% and 95% quantiles.
par(bty="L")
benefitCiLoo <- multiRFX3TplCiCorLoo[3,"os",,] - multiRFX3TplCiCorLoo[2,"os",,]
r <- rank(rowMeans(benefitCiLoo))
s <- clinicalData$AOD < 60 & !is.na(clinicalData$CR_date) & !clinicalData$TPL_Phase %in% c("PR1","RD1")
for(q in c(10, 50, 190)){
w <- which(r==q)
fAlloRelapse <- sum(prdData$transplantRel & s[!is.na(clinicalData$Recurrence_date)][prdData$index])/sum(relData$status & !relData$transplantCR1 & s[relData$index]) # fraction of patients that have received a salvage transplant
o <- order(-benefitAllo + ifelse(!s,NA,0), na.last=NA)
pRelapse <- 1-multiRFX3TplCiCorLoo[,"cir",w,] ## Relapse probabilities
fRelapse <- sapply(seq_along(o), function(i) mean(c(pRelapse[2,o[1:i]], pRelapse[1,o[-(1:i)]]), na.rm=TRUE)) # Personalised
sIdeal <- sapply(seq_along(o), function(i) mean(c(multiRFX3TplCiCorLoo["CR1","os",w,o[1:i]], (1-fAlloRelapse)*multiRFX3TplCiCorLoo["None","os",w,o[-(1:i)]] + fAlloRelapse*multiRFX3TplCiCorLoo["Relapse","os",w, o[-(1:i)]]), na.rm=TRUE))
x <- seq_along(sIdeal)/length(sIdeal)
plot(x + (1-x)*fRelapse*fAlloRelapse,sIdeal, type='l', xlab="Total fraction of allografts", ylab="Survival of eligible patients 3yrs after CR", col=set1[1], xaxs="i", yaxs="i", lty=1, ylim=c(0.59, 0.67))
p <- order(na.zero(c(1,4,2,3)[clinicalData$M_Risk]) + dataFrame$AOD_10/20 + ifelse(!s,NA,0) + ifelse(!s,NA,0), na.last=NA)
fRelapseEln <- sapply(seq_along(p), function(i) mean(c(pRelapse[2,p[1:i]], pRelapse[1,p[-(1:i)]]), na.rm=TRUE)) # ELN
sEln <- sapply(seq_along(p), function(i) mean(c(multiRFX3TplCiCorLoo["CR1","os",w,p[1:i]], (1-fAlloRelapse)*multiRFX3TplCiCorLoo["None","os",w,p[-(1:i)]] + fAlloRelapse*multiRFX3TplCiCorLoo["Relapse","os",w, p[-(1:i)]]), na.rm=TRUE))
x <- seq_along(sEln)/length(sEln)
lines(x + (1-x)*fRelapseEln*fAlloRelapse,sEln, sEln, type='l', col=set1[2])
legend("bottomright", c("Priorisation based on", "Knowledge bank", "ELN and age"), col=set1[c(NA,1,2)],lty=c(NA,1,1), bty="n", text.font=c(2,1,1))
}
The bottom line is that we are able to confidently isolate a quarter of patients with high benefit of allografts (about 12% absolute benefit). The breakdown across ELN risk groups is:
table(benefitGroup[s], paste(clinicalData$M_Risk[s]), allograft=data$transplantCR1[s])
## , , allograft = 0
##
##
## Adverse Favorable Inter-1 Inter-2 NA
## Low 37 321 120 87 40
## High 12 9 29 4 15
##
## , , allograft = 1
##
##
## Adverse Favorable Inter-1 Inter-2 NA
## Low 51 54 87 58 16
## High 17 2 21 11 4
summary(coxph(Surv(time1/365, time2/365, status) ~ transplantCR1 + AOD_10 + clinicalData$M_Risk[osData$index] , data=osData, subset=osData$index %in% which(s) & benefitGroup[osData$index]=="Low"))
## Call:
## coxph(formula = Surv(time1/365, time2/365, status) ~ transplantCR1 +
## AOD_10 + clinicalData$M_Risk[osData$index], data = osData,
## subset = osData$index %in% which(s) & benefitGroup[osData$index] ==
## "Low")
##
## n= 1225, number of events= 342
## (90 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transplantCR1 -0.30067 0.74032 0.12538 -2.398 0.016483 *
## AOD_10 0.24248 1.27440 0.05606 4.326 1.52e-05 ***
## clinicalData$M_Risk[osData$index]Favorable -1.41836 0.24211 0.17131 -8.279 < 2e-16 ***
## clinicalData$M_Risk[osData$index]Inter-1 -0.47766 0.62023 0.16191 -2.950 0.003176 **
## clinicalData$M_Risk[osData$index]Inter-2 -0.65588 0.51899 0.17808 -3.683 0.000231 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## transplantCR1 0.7403 1.3508 0.5790 0.9466
## AOD_10 1.2744 0.7847 1.1418 1.4224
## clinicalData$M_Risk[osData$index]Favorable 0.2421 4.1303 0.1731 0.3387
## clinicalData$M_Risk[osData$index]Inter-1 0.6202 1.6123 0.4516 0.8519
## clinicalData$M_Risk[osData$index]Inter-2 0.5190 1.9268 0.3661 0.7358
##
## Concordance= 0.666 (se = 0.016 )
## Rsquare= 0.087 (max possible= 0.971 )
## Likelihood ratio test= 111.5 on 5 df, p=0
## Wald test = 110.2 on 5 df, p=0
## Score (logrank) test = 118.8 on 5 df, p=0
summary(coxph(Surv(time1/365, time2/365, status) ~ transplantCR1 + AOD_10 + clinicalData$M_Risk[osData$index] , data=osData, subset=osData$index %in% which(s) & benefitGroup[osData$index]=="High"))
## Call:
## coxph(formula = Surv(time1/365, time2/365, status) ~ transplantCR1 +
## AOD_10 + clinicalData$M_Risk[osData$index], data = osData,
## subset = osData$index %in% which(s) & benefitGroup[osData$index] ==
## "High")
##
## n= 174, number of events= 73
## (25 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transplantCR1 -0.7061 0.4936 0.2569 -2.749 0.00599 **
## AOD_10 0.0705 1.0731 0.1187 0.594 0.55235
## clinicalData$M_Risk[osData$index]Favorable -0.8071 0.4461 0.4328 -1.865 0.06222 .
## clinicalData$M_Risk[osData$index]Inter-1 -0.2869 0.7506 0.2733 -1.050 0.29377
## clinicalData$M_Risk[osData$index]Inter-2 -1.3228 0.2664 0.4965 -2.664 0.00771 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## transplantCR1 0.4936 2.0260 0.2983 0.8166
## AOD_10 1.0730 0.9319 0.8504 1.3540
## clinicalData$M_Risk[osData$index]Favorable 0.4461 2.2414 0.1910 1.0421
## clinicalData$M_Risk[osData$index]Inter-1 0.7505 1.3324 0.4393 1.2824
## clinicalData$M_Risk[osData$index]Inter-2 0.2664 3.7540 0.1007 0.7049
##
## Concordance= 0.674 (se = 0.036 )
## Rsquare= 0.099 (max possible= 0.969 )
## Likelihood ratio test= 18.22 on 5 df, p=0.002679
## Wald test = 15.71 on 5 df, p=0.00773
## Score (logrank) test = 16.76 on 5 df, p=0.004988
Distribution of genetic variables
par(mar=c(6,3,1,1))
o <- order(-colSums(dataFrame[s,groups %in% c("Genetics","CNA","Fusions")]))
t <- t(sapply(split(dataFrame[s,groups %in% c("Genetics","CNA","Fusions")], benefitGroup[s]), colSums))
barplot(t[,o], col=pastel1[1:2], legend=TRUE, las=2, xaxs="i", cex.lab=0.66, args.legend=list(title="Benefit", bty="n", border=NA), border=NA)
cvFold <- nrow(dataFrame)
cvIdx <- 1:nrow(dataFrame)
p <- Reduce("rbind", mclapply(cvIdx, function(i){
whichTrain <- which(cvIdx != i)
rfxOS <- CoxRFX(osData[osData$index %in% whichTrain, names(crGroups)], Surv(osData$time1, osData$time2, osData$status)[osData$index %in% whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
p <- as.data.frame(predict(rfxOS, newdata=osData[!osData$index %in% whichTrain, names(crGroups)], se.fit=TRUE))
s <- summary(survfit(Surv(osData$time1, osData$time2, osData$status)[osData$index %in% whichTrain] ~ 1), time=3*365)$surv
cbind(p, surv=s^exp(p$fit))
}, mc.cores=10))
d <- duplicated(sub(".1$","",rownames(p)))
coxRFXOsCrLOO <- rbind(p[!d,], p[d,])
rm(p,d)
Compare with corresponding multistage predictions
m <- c(multiRFX3TplLoo[,3],multiRFX3TplLoo[osData$index[osData$transplantCR1==1],2])
r <- c(coxRFXOsCrLOO$surv[1:1540],coxRFXOsCrLOO$surv[osData$transplantCR1==1])
plot(m, r)
abline(0,1)
cor(m, r)
## [1] 0.9314411
3-state model
c <- Surv(as.numeric(clinicalData$Date_LF - clinicalData$CR_date), clinicalData$Status)
p <- multiRFX3TplCi[3,1,1,]
p[osData$index[osData$transplant1CR==1]] <- multiRFX3TplCi[2,1,1,]
ape(p, c, time=3*365)
## abs brier log2 bayes
## 0.3834699 0.1867695 0.7745674 0.2836879
RFX
unduplicate <- function(index) {u <- unique(index); u[which(rev(duplicated(rev(index))))] <- seq_along(index)[duplicated(index)]; return(u)}
q <- summary(survfit(Surv(time1,time2,status) ~ 1, data=osData), time=3*365)$surv^exp(scale(predict(coxRFXOsCR, newdata=osData[unduplicate(osData$index),]), scale=FALSE))
ape(q, c, time=3*365)
## abs brier log2 bayes
## 0.3919441 0.1918189 0.7925326 0.2931442
3-state model
p <- multiRFX3TplLoo[,3]
p[osData$index[osData$transplantCR1==1]] <- multiRFX3TplLoo[osData$index[osData$transplantCR1==1],2]
ape(p, c, time=3*365)
## abs brier log2 bayes
## 0.3969946 0.1995807 0.8183460 0.3159968
RFX
ape(coxRFXOsCrLOO$surv[unduplicate(osData$index)], c, time=3*365)
## abs brier log2 bayes
## 0.4086589 0.2070700 0.8496459 0.3167849
Not all variables are needed to make predictions. Here we systematically assess the effect of the set of know genes for our preditive accuracy.
w <- WaldTest(coxRFXFitOsTDGGc)
o <- order(w$p.value[groups[whichRFXOsTDGG] %in% c("Genetics","GeneGene")])
genes <- unique(sub("_.+","",unlist(strsplit(names(whichRFXOsTDGG[groups[whichRFXOsTDGG]%in% c("Genetics","GeneGene")])[o],":"))))
cvFold <- 1540
foo <- 42
set.seed(foo)
cvIdx <- 1:cvFold #sample(1:nrow(dataFrame)%% cvFold +1 ) ## sample 1/10
m <- unlist(sapply(1:cvFold, function(i) which(tplSplitOs %in% which(cvIdx==i))))
o <- order(m)
imputedRiskCv <- do.call("abind", c(mclapply(1:cvFold, function(i){
whichTrain <- which(cvIdx != i)
ix <- tplSplitOs %in% whichTrain
cRfx <- CoxRFX(dataFrameOsTD[ix,whichRFXOsTDGG], osTD[ix], groups[whichRFXOsTDGG], which.mu=mainGroups) ## allow only the main groups to have mean different from zero..
imputedRisk <- sapply(mclapply(c(0,seq_along(genes)), function(i){
na.genes <- if(i==0) genes else genes[-(1:i)]
if(length(na.genes)==0) na.genes <- "FOO42"
d <- dataFrameOsTD[,whichRFXOsTDGG]
d[grepl(paste(na.genes, collapse="|"), colnames(d))] <- NA
p <- PredictRiskMissing(cRfx, d[!ix,,drop=FALSE])
}, mc.cores=1), I, simplify="array")
dimnames(imputedRisk)[[3]] <- c("None",genes)
return(imputedRisk)
}, mc.cores=10), along=1))[o,,]
par(mar=c(3,3,3,1))
imputedCCv <- sapply(dimnames(imputedRiskCv)[[3]], function(i) as.numeric(survConcordance(osTD ~ imputedRiskCv[,1,i])[c("concordance","std.err")]))
x <- 0:ncol(imputedCCv)-.5
plot(x, c(imputedCCv[1,], imputedCCv[1,ncol(imputedCCv)]), type="s", xaxt="n", xlab="", ylab="Concordance", ylim=range(imputedCCv[1,]) + c(-1,1)*imputedCCv[2,1])
polygon(c(rep(x,each=2)[-c(1, 2*length(x))],rep(rev(x), each=2)[-c(1, 2*length(x))]), c(rep(imputedCCv[1,]+imputedCCv[2,], each=2), rep(rev(imputedCCv[1,]-imputedCCv[2,]), each=2)), border=NA, col="#00000044")
mtext(dimnames(imputedRiskCv)[[3]], side=1, at=1:ncol(imputedCCv)-1, las=2, font=3, cex=.9)
abline(v=seq(0,50,10), lty=3)
abline(h=seq(0.68,0.73,0.01), lty=3)
axis(side=3)
read_chunk('../code/imputation.R', labels="imputationMultiRfx")
load("loo.RData")
load("genesImputation.RData")
library(mg14)
library(CoxHD)
library(Rcpp)
i <- as.numeric(Sys.getenv("LSB_JOBINDEX"))
cvIdx <- 1:nrow(dataFrame)
whichTrain <- which(cvIdx != i)
e <- new.env()
t <- try(load(paste0("loo/",i,".RData"), env=e))
if(class(t)=="try-error"){
stop()
}else{
whichTrain <<- (1:nrow(data))[-i]
dMiss <- do.call("rbind", lapply(c(0,seq_along(genes)), function(g){
na.genes <- if(g==0) genes else genes[-(1:g)]
if(length(na.genes)==0) na.genes <- "FOO42"
d <- data[i,, drop=FALSE]
d[grepl(paste(na.genes, collapse="|"), colnames(d))] <- NA
d
}))
xx <- 0:2000
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index])[prdData$index %in% whichTrain,])
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
coxphOs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1]))[osData$index %in% whichTrain,])
tdOsBaseline <- exp(predict(coxphOs, newdata=data.frame(time0=xx[-1])))
multiRfx5Imputed <- MultiRFX5(e$rfxEs, e$rfxCr, e$rfxNrs, e$rfxRel, e$rfxPrs, dMiss, tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=2000)
save(multiRfx5Imputed, file=paste0("imputed/",i,".RData"))
}
Collect data
multiRfx5CvImputed <- sapply(mclapply(1:nrow(data), function(i){
e <- new.env()
t <- try(load(paste0("../code/imputed/",i,".RData"), env=e))
if(class(t)=="try-error") return(rep(NA, length(genes)+1))
else colSums(e$multiRfx5Imputed[3*365,1:3,])
}, mc.cores=10), I)
Imputed accuracy
par(mar=c(3,3,3,1))
multiRfx5CvImputedC <- sapply(1:nrow(multiRfx5CvImputed), function(i) as.numeric(survConcordance(os ~ multiRfx5CvImputed[i,])[c('concordance','std.err')]))
x <- 0:ncol(multiRfx5CvImputedC)-.5
plot(x, c(multiRfx5CvImputedC[1,], multiRfx5CvImputedC[1,ncol(multiRfx5CvImputedC)]), type="s", xaxt="n", xlab="", ylab="Concordance", ylim=range(multiRfx5CvImputedC[1,]) + c(-1,1)*multiRfx5CvImputedC[2,1])
polygon(c(rep(x,each=2)[-c(1, 2*length(x))],rep(rev(x), each=2)[-c(1, 2*length(x))]), c(rep(multiRfx5CvImputedC[1,]+multiRfx5CvImputedC[2,], each=2), rep(rev(multiRfx5CvImputedC[1,]-multiRfx5CvImputedC[2,]), each=2)), border=NA, col="#00000044")
mtext(dimnames(imputedRiskCv)[[3]], side=1, at=1:ncol(multiRfx5CvImputedC)-1, las=2, font=3, cex=.9)
abline(v=seq(0,50,10), lty=3)
abline(h=seq(0.68,0.73,0.01), lty=3)
axis(side=3)
par(mar=c(3,3,3,1))
multiRfx5CvImputedApe <- sapply(1:nrow(multiRfx5CvImputed), function(i) ape(1-multiRfx5CvImputed[i,], os, 3*365))
x <- 0:ncol(multiRfx5CvImputedApe)-.5
for(i in 1:4){
plot(x, c(multiRfx5CvImputedApe[i,], multiRfx5CvImputedApe[i,ncol(multiRfx5CvImputedApe)]), type="s", xaxt="n", xlab="", ylab=rownames(multiRfx5CvImputedApe)[i], col=set1[i])
mtext(dimnames(imputedRiskCv)[[3]], side=1, at=1:ncol(multiRfx5CvImputedApe)-1, las=2, font=3, cex=.9)
abline(v=seq(0,50,10), lty=3)
abline(h=axTicks(side=2), lty=3)
axis(side=3)
}
We used cross-validation to evaluate the performance of different modelling strategies. The idea is to split the data into a training and a test set; the model is fitted on the training part and its prognostic accuracy evaluated on the test set.
We have randomly split the data 100 times 80% training data and 20% validation data. For each split, we evaluated the following metrics:
survival::survConcordance()survAUC::auc.Uno().survAUC::predErr().survAUC::OXS().The latter three algorithms are implemented in the survAUC R package (Potapov, Adler, and Schmid. 2012).
The data comprised patients from three different trials - AMLSG07/04 (n=740), AMLHD98A (n=627), and AMLHD98B (n=173). In addition to randomly splitting the data into training and test partitions, we trained the model on all three combinations of 2 trials and evaluated the prognostic accuracy on the third trial. This situation is more challenging as there may be some systematic differences between the trials, but it can also be expected to more closely mimic the situation of predicting a novel cohort.
As an additional and independent evaluation cohort, we downloaded data from the cancer genome atlas (TCGA) (Cancer Genome Atlas Research Network 2013). We downloaded variant calls from exome sequencing and cytogenetic data for n=200 and annotated oncogenic mutations as described in our companion paper. Overall survival was available for n=186 patients. For missing prognostic variables, we use a covariance-based imputation, with a covariance matrix derived from our original data set (n=1,540). We note that there was no data available for allografts.
library(rpart)
library(randomForestSRC)
##
## randomForestSRC 1.6.0
##
## Type rfsrc.news() to see new features, changes, and bug fixes.
##
replicates <- 100 ## number of replicates
scope <- c("Genetics","CNA","Treatment","Fusions") ## For CPSS
scopeStep <- as.formula(paste("os ~", paste(colnames(dataFrame)[mainIdxOs& osIdx], collapse="+"))) ## For AIC&BIC
allModelsCV <- mclapply(1:replicates, function(foo){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
c <- coxph(os[trainIdx] ~ 1, data=dataFrame[trainIdx,mainIdxOs])
scopeStep <- as.formula(paste("os[trainIdx] ~", paste(colnames(dataFrame)[mainIdxOs], collapse="+")))
coxBICOsTrain <- step(c, scope=scopeStep, k = log(sum(trainIdx)), trace=0)
coxAICOsTrain <- step(coxBICOsTrain, scope=scopeStep, k = 2, trace=0)
coxCPSSOsTrain <- CoxCPSSInteractions(dataFrame[!is.na(os) & trainIdx, mainIdxOs], na.omit(os[trainIdx]), bootstrap.samples=50, scope = which(groups %in% scope))
coxRFXOsTrain <- CoxRFX(dataFrame[trainIdx,mainIdxOs], os[trainIdx], groups=groups[mainIdxOs])
coxRFXOsTrain$Z <- NULL
coxRFXOsGGc <- CoxRFX(dataFrame[trainIdx,whichRFXOsGG], os[trainIdx], groups=groups[whichRFXOsGG], which.mu=mainGroups)
coxRFXOsGGc$Z <- NULL
rForestOsTrain <- rfsrc(Surv(time, status) ~.,data= cbind(time = os[,1], status = os[,2], dataFrame[,mainIdxOs])[trainIdx,], ntree=100, importance="none")
return(list(
BIC=coxBICOsTrain,
AIC=coxAICOsTrain,
CPSS=coxCPSSOsTrain,
RFX=coxRFXOsTrain,
RFXgg=coxRFXOsGGc,
rForest=rForestOsTrain
))
}, mc.cores=10)
Compute predictions for all model fits
predictAllModels <- function(x, newdata){
if("rfsrc" %in% class(x)){
predict(x, newdata, importance="none")$predicted
}else{
predict(x, newdata)
}
}
allModelsCvPredictions <- mclapply(seq_along(allModelsCV), function(foo){
set.seed(foo)
x <- allModelsCV[[foo]]
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
cbind(ELN=c(4,1,3,2)[clinicalData$M_Risk[!trainIdx]],
sapply(x, function(y){
predictAllModels(y, newdata=dataFrame[!trainIdx,])
}))
}, mc.cores=10)
colModels <- c("#888888", set1[c(2,1,4,3,5,7)])
Harrel’s C
foo <- 1
allModelsCvC <- sapply(allModelsCvPredictions, function(x){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
foo <<- foo +1
apply(x, 2 , function(p){
survConcordance(osYr[!trainIdx,] ~ p)$concordance
})
})
apply(allModelsCvC,1,quantile)
## ELN BIC AIC CPSS RFX RFXgg rForest
## 0% 0.5880492 0.6423902 0.6358633 0.6454865 0.6534156 0.6579039 0.6539126
## 25% 0.6281709 0.6816841 0.6760669 0.6807904 0.6894067 0.6943526 0.6800193
## 50% 0.6437515 0.6918903 0.6923465 0.6955938 0.7050271 0.7095102 0.6918675
## 75% 0.6546729 0.7048202 0.7055158 0.7048667 0.7165585 0.7202644 0.7039033
## 100% 0.6863701 0.7407616 0.7372427 0.7390396 0.7483232 0.7490528 0.7363960
par(mar=c(3,3,1,1),bty="n", mgp=c(2,.5,0), las=2)
r <- sapply(as.data.frame(lapply(as.data.frame(t(apply(-allModelsCvC,2,rank))),factor, levels=1:7)),table)
o <- order(apply(allModelsCvC,1,median))
boxplot(t(allModelsCvC[o,]), notch=TRUE, ylab="Concordance", staplewex=0, lty=1, pch=16, xaxt="n")
rotatedLabel(1:7, rep(par("usr")[3],7), rownames(allModelsCvC)[o])
par(mar=c(3,3,3,1), xpd=NA, las=2, mgp=c(2,.5,0))
barplot(r[,o]/replicates, col=c(set1[c(3,2,4,1,5,7)],"grey"), ylab="Fraction", names.arg=rep("",ncol(r))) -> b
rotatedLabel(b, rep(par("usr")[3],6), colnames(allModelsCvC)[o])
legend(par("usr")[1],1.5, fill=c(set1[c(3,2,4,1,5,7)],"grey"), legend=1:6, bty="n", border=NA, horiz=TRUE, title="Rank")
Brier scores
library(survAUC)
foo <- 1
allModelsCvBrier<- sapply(allModelsCV, function(x){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
foo <<- foo +1
sapply(x, function(y){
p <- predictAllModels(y, newdata=dataFrame)
a <- predErr(Surv.rsp = osYr[trainIdx,], Surv.rsp.new = osYr[!trainIdx,], lp=p[trainIdx], lpnew = p[!trainIdx], times= c(90,365,1000)/365, type="brier")$error
})
})
apply(allModelsCvBrier,1,quantile)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## 0% 0.05849797 0.1526225 0.1757253 0.05944734 0.1503072 0.1760910 0.05846343 0.1511088 0.1786459 0.05605151 0.1495037
## 25% 0.07750536 0.1753677 0.1937367 0.07789139 0.1761197 0.1978792 0.07718722 0.1741391 0.1952522 0.07664560 0.1721626
## 50% 0.08304442 0.1835706 0.2036727 0.08438041 0.1833998 0.2071846 0.08321443 0.1819881 0.2040855 0.08271161 0.1798175
## 75% 0.09084118 0.1898236 0.2102678 0.09188209 0.1912654 0.2164903 0.09047345 0.1882972 0.2085808 0.09088417 0.1865091
## 100% 0.10871873 0.2126778 0.2380005 0.10799058 0.2130452 0.2498866 0.10677651 0.2135231 0.2329530 0.10806363 0.2077360
## [,12] [,13] [,14] [,15] [,16] [,17] [,18]
## 0% 0.1741597 0.05555387 0.1464211 0.1717377 0.06177217 0.2322100 0.3047124
## 25% 0.1886995 0.07639397 0.1704750 0.1870156 0.09108158 0.2752806 0.3576704
## 50% 0.1977054 0.08184680 0.1792870 0.1958929 0.10079971 0.2934602 0.3789192
## 75% 0.2050220 0.09022773 0.1849654 0.2046791 0.11054862 0.3124864 0.3938336
## 100% 0.2407017 0.10635080 0.2035862 0.2412429 0.13336421 0.3411525 0.4467354
rownames(allModelsCvBrier) <- paste(rep(names(allModelsCV[[1]]), each=3), rep(c(90,365,1000), length(allModelsCV[[1]])))
boxplot(t(allModelsCvBrier)[,rep(0:5*3, 3) + rep(1:3, each=6)],notch=TRUE, ylab="Brier score", border=rep(colModels[-1],3), las=2, lty=1, pch=16, staplewex=0)
GHCI
allModelsCvGHCI<- sapply(allModelsCvPredictions, function(x){
apply(x[,2:6], 2 , function(p){
p <- GHCI(lpnew = na.omit(p))
})
})
apply(allModelsCvGHCI,1,quantile)
## BIC AIC CPSS RFX RFXgg
## 0% 0.6806809 0.6982268 0.6756061 0.6867523 0.6951368
## 25% 0.6944020 0.7155827 0.6928061 0.6965384 0.7075201
## 50% 0.7008508 0.7208792 0.6976200 0.7029009 0.7117394
## 75% 0.7047188 0.7271990 0.7032465 0.7079622 0.7169094
## 100% 0.7210609 0.7407248 0.7163284 0.7179690 0.7258588
boxplot(t(allModelsCvGHCI),notch=TRUE, ylab="GH C", border=colModels[2:6], las=2, lty=1, pch=16, staplewex=0)
OXS R2 estimates
foo <- 1
allModelsCvOXS <- sapply(allModelsCvPredictions, function(x){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
foo <<- foo +1
apply(x[,2:6], 2 , function(p){
a <- OXS(osYr[!trainIdx,], p, rep(0,length(p)))
})
})
apply(allModelsCvOXS,1,quantile)
## BIC AIC CPSS RFX RFXgg
## 0% 0.1289646 -4.161681e+238 0.08704625 0.1436286 0.1531176
## 25% 0.2990897 1.889040e-01 0.30089659 0.3180463 0.3265460
## 50% 0.3431369 2.620666e-01 0.34378953 0.3746620 0.3838629
## 75% 0.3862799 3.374566e-01 0.38786964 0.4142649 0.4283471
## 100% 0.4765289 4.554361e-01 0.47522406 0.5001872 0.5180129
boxplot(t(allModelsCvOXS), ylim=c(0,0.5), notch=TRUE, ylab="OXS R2", border=colModels[2:6], las=2, lty=1, pch=16, staplewex=0)
Nagelk R2 estimates
foo <- 1
allModelsCvNagelk <- sapply(allModelsCvPredictions, function(x){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
foo <<- foo +1
apply(x[,2:6], 2 , function(p){
a <- Nagelk(osYr[!trainIdx,], p, rep(0,length(p)))
})
})
apply(allModelsCvNagelk,1,quantile)
## BIC AIC CPSS RFX RFXgg
## 0% 0.08257594 -1.993769e+150 0.05609423 0.08634564 0.09225612
## 25% 0.18763141 1.149669e-01 0.19279267 0.20628936 0.21301746
## 50% 0.21625945 1.628463e-01 0.21672847 0.24227620 0.24468426
## 75% 0.24690641 2.180015e-01 0.24761718 0.27236095 0.28003519
## 100% 0.31557340 2.995430e-01 0.29898599 0.32345779 0.33542100
boxplot(t(allModelsCvNagelk), ylim=c(0,0.4), notch=TRUE, ylab="Nagelk's R2", border=colModels[2:6], las=2, lty=1, pch=16, staplewex=0)
UnoC
foo <- 1
allModelsCvUnoC<- sapply(allModelsCvPredictions, function(x){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
foo <<- foo +1
apply(x,2, function(p){
a <- UnoC(Surv.rsp = osYr[trainIdx,], Surv.rsp.new = osYr[!trainIdx,][!is.na(p)], lpnew = na.omit(p), time=5)
})
})
apply(allModelsCvUnoC,1,quantile)
## ELN BIC AIC CPSS RFX RFXgg rForest
## 0% 0.4512105 0.6430067 0.6334845 0.6451623 0.6509440 0.6557092 0.6498282
## 25% 0.5026052 0.6825757 0.6772569 0.6826433 0.6911294 0.6934710 0.6797299
## 50% 0.5176337 0.6926510 0.6930621 0.6956262 0.7032943 0.7085578 0.6910261
## 75% 0.5304213 0.7048022 0.7047715 0.7049114 0.7170992 0.7202833 0.7029446
## 100% 0.5736298 0.7379782 0.7371860 0.7353561 0.7506707 0.7493965 0.7393023
boxplot(t(allModelsCvUnoC), notch=TRUE, ylab="Uno's C", border=colModels, lty=1, pch=16, staplewex=0)
AUC UNO
t <- seq(0.1,5,0.1) #times
allModelsCvAuc <- sapply(seq_along(allModelsCvPredictions), function(foo){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
apply(allModelsCvPredictions[[foo]], 2, function(p){
AUC.uno(osYr[trainIdx,], osYr[!trainIdx, ][!is.na(p)], scale(na.omit(p)), t)$auc
})
})
allModelsCvAuc <- array(allModelsCvAuc, dim=c(length(t),ncol(allModelsCvPredictions[[1]]),length(allModelsCvPredictions)))
plot(NA,NA, xlab="Years",ylab="AUC", xlim=range(t), ylim=c(0.5,0.8))
for(i in 1:dim(allModelsCvAuc)[2]){
lines(t,rowMeans(allModelsCvAuc, dims=2)[,i], type='l', new=i==1, col=colModels[i])
}
legend("bottomright", colnames(allModelsCvPredictions[[1]]), bty="n", lty=1, col=colModels)
Wisdom of the crowds?
foo <- 1
allModelsCvCCrowd <- sapply(allModelsCvPredictions, function(x){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
foo <<- foo +1
r <- rowMeans(apply(x, 2 , rank))
survConcordance(osYr[!trainIdx,] ~ r)$concordance
})
quantile(allModelsCvCCrowd)
## 0% 25% 50% 75% 100%
## 0.6648901 0.6949264 0.7088463 0.7206734 0.7520978
boxplot(cbind(t(allModelsCvC),allModelsCvCCrowd), notch=TRUE, ylab="Concordance", border=c(colModels,1), las=2, lty=1, pch=16, staplewex=0)
ranks <- apply(apply(-cbind(t(allModelsCvC),kraut=allModelsCvCCrowd),1,rank, ties.method="random"),1,function(x) table(factor(x, levels=1:8)))
ranks <- ranks[,order(1:8 %*% ranks)]
Clean up..
rm(allModelsCV)
Here we assess RFX models with interaction terms for different variable categories.
replicates <- 100 ## number of replicates
allModelsCvRfxC <- do.call("rbind",mclapply(1:100, function(foo){
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrameOsTD)%%5 +1 )[tplSplitOs]!=1 ## sample 1/5
coxRFXOsMain <- CoxRFX(dataFrameOsTD[trainIdx,mainIdxOsTD], osTD[trainIdx], groups=groups[mainIdxOsTD])
coxRFXOsGG <- CoxRFX(dataFrameOsTD[trainIdx,whichRFXOsTDGG], osTD[trainIdx], groups=groups[whichRFXOsTDGG])
coxRFXOsGGc <- CoxRFX(dataFrameOsTD[trainIdx,whichRFXOsTDGG], osTD[trainIdx], groups=groups[whichRFXOsTDGG], which.mu=mainGroups)
coxRFXOsAll <- CoxRFX(dataFrameOsTD[trainIdx,whichRFXOsTD], osTD[trainIdx], groups=groups[whichRFXOsTD])
coxRFXOsAllc <- CoxRFX(dataFrameOsTD[trainIdx,whichRFXOsTD], osTD[trainIdx], groups=groups[whichRFXOsTD], which.mu=mainGroups)
return(c(
Main=survConcordance(osTD[!trainIdx]~as.matrix(dataFrameOsTD[!trainIdx,mainIdxOsTD]) %*% coef(coxRFXOsMain))$concordance,
GeneGene=survConcordance(osTD[!trainIdx]~as.matrix(dataFrameOsTD[!trainIdx,whichRFXOsTDGG]) %*% coef(coxRFXOsGG))$concordance,
GeneGeneCentred=survConcordance(osTD[!trainIdx]~as.matrix(dataFrameOsTD[!trainIdx,whichRFXOsTDGG]) %*% coef(coxRFXOsGGc))$concordance,
AllInt=survConcordance(osTD[!trainIdx]~as.matrix(dataFrameOsTD[!trainIdx,whichRFXOsTD]) %*% coef(coxRFXOsAll))$concordance,
AllIntCentred=survConcordance(osTD[!trainIdx]~as.matrix(dataFrameOsTD[!trainIdx,whichRFXOsTD]) %*% coef(coxRFXOsAllc))$concordance
))
}, mc.cores=10))
colnames(allModelsCvRfxC) <- sub(".concordant","",colnames(allModelsCvRfxC))
par(mar=c(3,3,1,1),bty="n", mgp=c(2,.5,0), las=2)
r <- sapply(as.data.frame(lapply(as.data.frame(round(t(apply(-allModelsCvRfxC,1,rank)))),factor, levels=1:6)),table)
o <- order(colMeans(allModelsCvRfxC))
boxplot(allModelsCvRfxC[,o], notch=TRUE, ylab="Concordance", staplewex=0, lty=1, pch=16, xaxt="n")
rotatedLabel(1:ncol(allModelsCvRfxC), rep(par("usr")[3],ncol(allModelsCvRfxC)), colnames(allModelsCvRfxC)[o])
par(mar=c(3,3,3,1), xpd=NA, las=2, mgp=c(2,.5,0))
barplot(r[,o]/replicates, col=set1[c(3,2,4,1,5,7)][1:ncol(allModelsCvRfxC)], ylab="Fraction", names.arg=rep("",ncol(r))) -> b
rotatedLabel(b, rep(par("usr")[3],ncol(allModelsCvRfxC)), colnames(allModelsCvRfxC)[o])
legend(par("usr")[1],1.5, fill=set1[c(3,2,4,1,5,7)][1:ncol(allModelsCvRfxC)], legend=1:ncol(allModelsCvRfxC), bty="n", border=NA, horiz=TRUE, title="Rank")
The following models allow for quantifying the effect of a time-dependent covariate, such as a bone marrow transplant, which is typically administered after diagnosis. The subsequent code is executed on our LSF cluster for 100 replicates
read_chunk('../code/cv100.R', labels="allModelsCVTDCode")
#save(dataFrame, nrdData, crGroups, mainGroups, prdData, relData, prdData, osData, cr, dataFrameOsTD, dataFrame, osTD, tplSplitOs, groups, data, whichRFXOsTDGG, mainIdxOs, clinicalData, MultiRFX5, os, mainIdxOsTD, scope, whichRFXOsGG, file="../../code/cv100.RData")
load("cv100.RData")
library(mg14)
library(CoxHD)
library(Rcpp)
library(randomForestSRC)
jobIndex <- as.numeric(Sys.getenv("LSB_JOBINDEX"))
set.seed(jobIndex)
splits <- sample(1:nrow(dataFrame)%%5 +1 )
trainIdx <- splits!=1 ## sample 1/5
# Static models (other)
c <- coxph(os[trainIdx] ~ 1, data=dataFrame[trainIdx,mainIdxOs])
scope <- c("Genetics","CNA","Treatment","Fusions") ## For CPSS
scopeStep <- as.formula(paste("os[trainIdx] ~", paste(colnames(dataFrame)[mainIdxOs], collapse="+")))
coxBICOs <- step(c, scope=scopeStep, k = log(sum(trainIdx)), trace=0)
coxAICOs <- step(coxBICOs, scope=scopeStep, k = 2, trace=0)
coxCPSSOs <- CoxCPSSInteractions(dataFrame[!is.na(os) & trainIdx, mainIdxOs], na.omit(os[trainIdx]), bootstrap.samples=50, scope = which(groups %in% scope))
coxRFXOs <- CoxRFX(dataFrame[trainIdx,mainIdxOs], os[trainIdx], groups=groups[mainIdxOs])
coxRFXOs$Z <- NULL
coxRFXOsGGc <- CoxRFX(dataFrame[trainIdx,whichRFXOsGG], os[trainIdx], groups=groups[whichRFXOsGG], which.mu=mainGroups)
coxRFXOsGGc$Z <- NULL
rForestOsTrain <- rfsrc(Surv(time, status) ~.,data= cbind(time = os[,1], status = os[,2], dataFrame[,mainIdxOs])[trainIdx,], ntree=100, importance="none")
# Time-dependent models
trainIdxTD <- splits[tplSplitOs]!=1 ## sample 1/5
c <- coxph(osTD[trainIdxTD] ~ 1, data=dataFrameOsTD[trainIdxTD,mainIdxOsTD])
scopeStep <- as.formula(paste("osTD[trainIdx] ~", paste(colnames(dataFrameOsTD)[mainIdxOsTD], collapse="+")))
coxBICOsTD <- step(c, scope=scopeStep, k = log(sum(trainIdxTD)), trace=0)
coxAICOsTD <- step(coxBICOsTD, scope=scopeStep, k = 2, trace=0)
coxRFXOsTD <- CoxRFX(dataFrameOsTD[trainIdxTD,mainIdxOsTD], osTD[trainIdxTD], groups=groups[mainIdxOsTD])
coxRFXOsTD$Z <- NULL
coxRFXOsTDGGc <- CoxRFX(dataFrameOsTD[trainIdxTD,whichRFXOsTDGG], osTD[trainIdxTD], groups=groups[whichRFXOsTDGG], which.mu=mainGroups)
coxRFXOsTDGGc$Z <- NULL
# Multi-stage model
whichTrain <- which(trainIdx)
rfxNrs <- CoxRFX(nrdData[nrdData$index %in% whichTrain, names(crGroups)], Surv(nrdData$time1, nrdData$time2, nrdData$status)[nrdData$index %in% whichTrain], groups=crGroups, nu=ifelse(jobIndex==45,1,0), which.mu = intersect(mainGroups, unique(crGroups))) #avoiding data singularity in split 45
rfxNrs$coefficients["transplantRel"] <- 0
rfxPrs <- CoxRFX(prdData[prdData$index %in% whichTrain, names(crGroups)], Surv(prdData$time1, prdData$time2, prdData$status)[prdData$index %in% whichTrain], groups=crGroups, nu=1, which.mu = intersect(mainGroups, unique(crGroups)))
rfxRel <- CoxRFX(relData[relData$index %in% whichTrain, names(crGroups)], Surv(relData$time1, relData$time2, relData$status)[relData$index %in% whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
rfxRel$coefficients["transplantRel"] <- 0
rfxCr <- CoxRFX(osData[whichTrain, names(crGroups)], Surv(cr[,1], cr[,2]==2)[whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)), nu=ifelse(jobIndex==55,1,0))
rfxEs <- CoxRFX(osData[whichTrain, names(crGroups)], Surv(cr[,1], cr[,2]==1)[whichTrain], groups=crGroups, which.mu = NULL)
ix <- tplSplitOs %in% whichTrain
rfxOs <- CoxRFX(dataFrameOsTD[ix,whichRFXOsTDGG], osTD[ix], groups[whichRFXOsTDGG], which.mu=mainGroups) ## allow only the main groups to have mean different from zero..
xx <- 0:2000
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index])[prdData$index %in% whichTrain,]) #avoiding data singularity in split 55
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
coxphOs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1]))[osData$index %in% whichTrain,])
tdOsBaseline <- exp(pmin(predict(coxphOs, newdata=data.frame(time0=500)),predict(coxphOs, newdata=data.frame(time0=xx[-1])))) ## cap predictions at induction length 500 days.
dataTD <- data[tplSplitOs, ]
dataTD$transplantCR1[1:nrow(data)] <- 0
dataTD$transplantRel[1:nrow(data)] <- 0
multiRfx5 <- MultiRFX5(rfxEs, rfxCr, rfxNrs, rfxRel, rfxPrs, dataTD[!trainIdxTD,], tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=2000)
save(rfxEs, rfxCr, rfxEs, rfxNrs, rfxPrs, rfxRel, rfxOs, multiRfx5, coxBICOs, coxBICOsTD, coxAICOs, coxAICOsTD, coxCPSSOs, coxRFXOs, coxRFXOsTD, coxRFXOsGGc, coxRFXOsTDGGc, rForestOsTrain, file=paste0("cv100/",jobIndex,".RData"))
Gathering results and computing predictions
replicates <- 100
allModelsCvTdPredictions <- mclapply(1:replicates, function(foo) try({
e <- new.env()
load(paste0("../code/cv100/",foo,".RData"), envir=e)
set.seed(foo)
x <- list(
BIC=e$coxBICOsTD,
AIC=e$coxAICOsTD,
RFX=e$coxRFXOsTD,
RFXgg=e$coxRFXOsTDGGc
)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )[tplSplitOs]!=1 ## sample 1/5
pred <- cbind(ELN=c(4,1,3,2)[clinicalData$M_Risk[tplSplitOs][!trainIdx]],
sapply(x, function(y){
predictAllModels(y, newdata=dataFrameOsTD[!trainIdx,])
}))
pred <- cbind(pred, mRFX1yr=colSums(e$multiRfx5[365,1:3,]), mRFX3yr=colSums(e$multiRfx5[3*365,1:3,]), mRFX5yr=colSums(e$multiRfx5[5*365,1:3,]))
return(pred)
}), mc.cores=4)
Harrel’s C
allModelsCvTdC <- sapply(1:replicates, function(foo){
x <- allModelsCvTdPredictions[[foo]]
set.seed(foo)
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )[tplSplitOs]!=1 ## sample 1/5
apply(x, 2 , function(p){
survConcordance(osYrTD[!trainIdx,] ~ p)$concordance
})
})
apply(allModelsCvTdC,1,quantile)
## ELN BIC AIC RFX RFXgg mRFX1yr mRFX3yr mRFX5yr
## 0% 0.5779764 0.6475790 0.6462563 0.6617532 0.6662437 0.6686109 0.6729061 0.6733210
## 25% 0.6199073 0.6939107 0.6910616 0.7015326 0.7070262 0.6937961 0.7050696 0.7050183
## 50% 0.6369861 0.7048814 0.7046267 0.7175752 0.7211409 0.7070192 0.7178056 0.7178882
## 75% 0.6481770 0.7196140 0.7164776 0.7291661 0.7329534 0.7215234 0.7331457 0.7326754
## 100% 0.6805547 0.7492823 0.7440624 0.7633471 0.7621882 0.7538965 0.7604133 0.7600605
par(mar=c(3,3,1,1),bty="n", mgp=c(2,.5,0), las=2)
r <- sapply(as.data.frame(lapply(as.data.frame(t(apply(-allModelsCvTdC,2,rank))),factor, levels=1:nrow(allModelsCvTdC))),table)
o <- order(apply(allModelsCvTdC,1,median))
boxplot(t(allModelsCvTdC[o,]), notch=TRUE, ylab="Concordance", staplewex=0, lty=1, pch=16, xaxt="n")
rotatedLabel(1:nrow(allModelsCvTdC), rep(par("usr")[3],nrow(allModelsCvTdC)), rownames(allModelsCvTdC)[o])
par(mar=c(3,3,3,1), xpd=NA, las=2, mgp=c(2,.5,0))
clr <- brewer.pal(nrow(allModelsCvTdC),"PiYG")#set1[c(3,2,4,1,5,7)]
barplot(r[,o]/replicates, col=clr[1:ncol(allModelsCvTdC)], ylab="Fraction", names.arg=rep("",ncol(r))) -> b
rotatedLabel(b, rep(par("usr")[3],ncol(allModelsCvTdC)), colnames(allModelsCvTdC)[o])
legend(par("usr")[1],1.5, fill=clr[1:nrow(allModelsCvTdC)], legend=1:nrow(allModelsCvTdC), bty="n", border=NA, horiz=TRUE, title="Rank")
allModelsTrial <- mclapply(levels(clinicalData$Study), function(foo){
#set.seed(foo)
trainIdx <- clinicalData$Study != foo
c <- coxph(os[trainIdx] ~ 1, data=dataFrame[trainIdx,mainIdxOs])
scopeStep <- as.formula(paste("os[trainIdx] ~", paste(colnames(dataFrame)[mainIdxOs], collapse="+")))
coxBICOsTrain <- step(c, scope=scopeStep, k = log(sum(trainIdx)), trace=0)
coxAICOsTrain <- step(coxBICOsTrain, scope=scopeStep, k = 2, trace=0)
coxCPSSOsTrain <- CoxCPSSInteractions(dataFrame[!is.na(os) & trainIdx, mainIdxOs], na.omit(os[trainIdx]), bootstrap.samples=50, scope = which(groups %in% scope))
w <- colnames(dataFrame[mainIdxOs])
w <- setdiff(w, names(which(colSums(dataFrame[trainIdx,w])==0)))
coxRFXOsTrain <- CoxRFX(dataFrame[trainIdx,w], os[trainIdx], groups=groups[w], nu = if(foo=="AMLSG0704") 1 else 0) # add prior for 0704 (just one group member)
coxRFXOsTrain$Z <- NULL
w <- whichRFXOsGG
w <- setdiff(w, which(colSums(dataFrame[trainIdx,w])==0))
coxRFXOsGGc <- CoxRFX(dataFrame[trainIdx,w], os[trainIdx], groups=groups[w], which.mu=mainGroups, nu = if(foo=="AMLSG0704") 1 else 0)
coxRFXOsGGc$Z <- NULL
rForestOsTrain <- rfsrc(Surv(time, status) ~.,data= cbind(time = os[,1], status = os[,2], dataFrame[,mainIdxOs])[trainIdx,], ntree=100, importance="none")
return(list(
BIC=coxBICOsTrain,
AIC=coxAICOsTrain,
CPSS=coxCPSSOsTrain,
RFX=coxRFXOsTrain,
RFXgg=coxRFXOsGGc,
rForest=rForestOsTrain
))
}, mc.cores=3)
names(allModelsTrial) <- levels(clinicalData$Study)
allModelsTrialPredictions <- mclapply(names(allModelsTrial), function(foo){
x <- allModelsTrial[[foo]]
trainIdx <- clinicalData$Study != foo
cbind(ELN=c(4,1,3,2)[clinicalData$M_Risk[!trainIdx]],
sapply(x, function(y){
predictAllModels(y, newdata=dataFrame[!trainIdx,])
}))
}, mc.cores=10)
names(allModelsTrialPredictions) <- names(allModelsTrial)
allModelsTrialC <- sapply(names(allModelsTrial), function(foo){
trainIdx <- clinicalData$Study != foo
apply(allModelsTrialPredictions[[foo]], 2 , function(p){
unlist( survConcordance(osYr[!trainIdx,] ~ p)[c("concordance","std.err")])
})
}, simplify="array")
allModelsTrialC
## , , AMLHD98A
##
## ELN BIC AIC CPSS RFX RFXgg rForest
## concordance.concordant 0.65602221 0.65776049 0.67425763 0.65551738 0.69185458 0.69107788 0.66051312
## std.err.std(c-d) 0.01533222 0.01554614 0.01554614 0.01554614 0.01554614 0.01554614 0.01554614
##
## , , AMLHD98B
##
## ELN BIC AIC CPSS RFX RFXgg rForest
## concordance.concordant 0.60724349 0.67239030 0.67552467 0.68635868 0.68138457 0.66578087 0.64172799
## std.err.std(c-d) 0.02607456 0.02595602 0.02595607 0.02595607 0.02595607 0.02595607 0.02595607
##
## , , AMLSG0704
##
## ELN BIC AIC CPSS RFX RFXgg rForest
## concordance.concordant 0.63806151 0.66111737 0.65393399 0.66226428 0.68447415 0.68294156 0.66402777
## std.err.std(c-d) 0.01552632 0.01550206 0.01550206 0.01550206 0.01550206 0.01550206 0.01550206
allModelsTrialTD <- mclapply(levels(clinicalData$Study), function(foo){
#set.seed(foo)
trainIdxTD <- clinicalData$Study[tplSplitOs] != foo
c <- coxph(osTD[trainIdxTD] ~ 1, data=dataFrameOsTD[trainIdxTD,mainIdxOsTD])
scopeStep <- as.formula(paste("osTD[trainIdx] ~", paste(colnames(dataFrameOsTD)[mainIdxOsTD], collapse="+")))
coxBICOsTrain <- step(c, scope=scopeStep, k = log(sum(trainIdxTD)), trace=0)
coxAICOsTrain <- step(coxBICOsTrain, scope=scopeStep, k = 2, trace=0)
coxRFXOsTrain <- CoxRFX(dataFrameOsTD[trainIdxTD,mainIdxOsTD], osTD[trainIdxTD], groups=groups[mainIdxOsTD], nu = if(foo=="AMLSG0704") 1 else 0)
coxRFXOsTrain$Z <- NULL
coxRFXOsGGc <- CoxRFX(dataFrameOsTD[trainIdxTD,whichRFXOsTDGG], osTD[trainIdxTD], groups=groups[whichRFXOsTDGG], which.mu=mainGroups, nu = if(foo=="AMLSG0704") 1 else 0)
coxRFXOsGGc$Z <- NULL
return(list(
BIC=coxBICOsTrain,
AIC=coxAICOsTrain,
RFX=coxRFXOsTrain,
RFXgg=coxRFXOsGGc ))
}, mc.cores=3)
names(allModelsTrialTD) <- levels(clinicalData$Study)
allModelsTrialTdPredictions <- mclapply(names(allModelsTrialTD), function(foo){
x <- allModelsTrialTD[[foo]]
trainIdxTD <<- clinicalData$Study[tplSplitOs] != foo
pred <- cbind(ELN=c(4,1,3,2)[clinicalData$M_Risk[tplSplitOs][!trainIdxTD]],
sapply(x, function(y){
predictAllModels(y, newdata=dataFrameOsTD[!trainIdxTD,])
}))
whichTrain <<- which(trainIdxTD[1:nrow(dataFrame)])
rfxNrs <- CoxRFX(nrdData[nrdData$index %in% whichTrain, names(crGroups)], Surv(nrdData$time1, nrdData$time2, nrdData$status)[nrdData$index %in% whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
rfxNrs$coefficients["transplantRel"] <- 0
rfxPrs <- CoxRFX(prdData[prdData$index %in% whichTrain, names(crGroups)], Surv(prdData$time1, prdData$time2, prdData$status)[prdData$index %in% whichTrain], groups=crGroups, nu=1, which.mu = intersect(mainGroups, unique(crGroups)))
rfxRel <- CoxRFX(relData[relData$index %in% whichTrain, names(crGroups)], Surv(relData$time1, relData$time2, relData$status)[relData$index %in% whichTrain], groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
rfxRel$coefficients["transplantRel"] <- 0
rfxCr <- CoxRFX(osData[whichTrain, names(crGroups)], Surv(cr[,1], cr[,2]==2)[whichTrain], groups=crGroups, which.mu = NULL)#intersect(mainGroups, unique(crGroups)))
rfxEs <- CoxRFX(osData[whichTrain, names(crGroups)], Surv(cr[,1], cr[,2]==1)[whichTrain], groups=crGroups, which.mu = NULL)
ix <- tplSplitOs %in% whichTrain
rfxOs <- CoxRFX(dataFrameOsTD[ix,whichRFXOsTDGG], osTD[ix], groups[whichRFXOsTDGG], which.mu=mainGroups) ## allow only the main groups to have mean different from zero..
xx <- 0:2000
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index])[prdData$index %in% whichTrain,])
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=xx[-1])))
coxphOs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1]))[osData$index %in% whichTrain,])
tdOsBaseline <- exp(pmin(predict(coxphOs, newdata=data.frame(time0=500)),predict(coxphOs, newdata=data.frame(time0=xx[-1])))) ## cap predictions at induction length 500 days.
dataTD <- data[tplSplitOs, ]
dataTD$transplantCR1[1:nrow(data)] <- 0
dataTD$transplantRel[1:nrow(data)] <- 0
multiRfx5 <- MultiRFX5(rfxEs, rfxCr, rfxNrs, rfxRel, rfxPrs, dataTD[!trainIdxTD,], tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=2000)
pred <- cbind(pred, mRFX1yr=colSums(multiRfx5[365,1:3,]), mRFX3yr=colSums(multiRfx5[3*365,1:3,]), mRFX5yr=colSums(multiRfx5[5*365,1:3,]))
return(pred)
}, mc.cores=3)
names(allModelsTrialTdPredictions) <- names(allModelsTrialTD)
allModelsTrialTdC <- sapply(names(allModelsTrialTD), function(foo){
trainIdx <- clinicalData$Study[tplSplitOs] != foo
apply(allModelsTrialTdPredictions[[foo]], 2 , function(p){
unlist( survConcordance(osYrTD[!trainIdx,] ~ p)[c("concordance","std.err")])
})
}, simplify="array")
allModelsTrialTdC
## , , AMLHD98A
##
## ELN BIC AIC RFX RFXgg mRFX1yr mRFX3yr mRFX5yr
## concordance.concordant 0.64679319 0.67046248 0.68279982 0.70245030 0.70055621 0.70234158 0.70868764 0.70878492
## std.err.std(c-d) 0.01579591 0.01588897 0.01588897 0.01588897 0.01588897 0.01588897 0.01588897 0.01588897
##
## , , AMLHD98B
##
## ELN BIC AIC RFX RFXgg mRFX1yr mRFX3yr mRFX5yr
## concordance.concordant 0.60665156 0.68014926 0.66209102 0.68314786 0.66875458 0.67142000 0.67495169 0.67088692
## std.err.std(c-d) 0.02622793 0.02608867 0.02608872 0.02608872 0.02608872 0.02608872 0.02608872 0.02608872
##
## , , AMLSG0704
##
## ELN BIC AIC RFX RFXgg mRFX1yr mRFX3yr mRFX5yr
## concordance.concordant 0.63467885 0.67383609 0.66879788 0.68775965 0.68597782 0.68237127 0.69286694 0.69053722
## std.err.std(c-d) 0.01583684 0.01581224 0.01581224 0.01581224 0.01581224 0.01581224 0.01581224 0.01581224
Fit a single tree (T. Therneau, Atkinson, and Ripley 2014) and a random forest model (Ishwaran et al. 2008).
library(rpart)
library(randomForestSRC)
tree <- rpart(os ~ ., data=dataFrame[mainIdxOs & osIdx])
plot(tree)
text(tree)
survConcordance(na.omit(os)~predict(tree))
## Call:
## survConcordance(formula = na.omit(os) ~ predict(tree))
##
## n= 1540
## Concordance= 0.636895 se= 0.00915092
## concordant discordant tied.risk tied.time std(c-d)
## 466728.00 206536.00 277070.00 432.00 17392.86
Random forest
rForest <- rfsrc(Surv(time, status) ~.,data= cbind(time = os[,1], status = os[,2], dataFrame[,mainIdxOs & osIdx]), ntree=100)
boxplot(rForest$importance ~ droplevels(groups[mainIdxOs & osIdx]), border= colGroups[mainGroups], staplewex=0, pch=16, cex=0.75, ylab="RSF importance", lty=1, xaxt="n")
rotatedLabel(labels=mainGroups)
rForestVimp <- sapply(mainGroups, function(g) vimp(rForest, colnames(dataFrame)[which(groups==g)]))
survConcordance(na.omit(os)~predict(rForest, importance="none")$predicted)
## Call:
## survConcordance(formula = na.omit(os) ~ predict(rForest, importance = "none")$predicted)
##
## n= 1540
## Concordance= 0.8960502 se= 0.01003199
## concordant discordant tied.risk tied.time std(c-d)
## 851547.00 98787.00 0.00 432.00 19067.49
Complementary pairs stability selection with interaction terms
set.seed(42)
coxCPSSIntOs <- CoxCPSSInteractions(dataFrame[!is.na(os),groups %in% mainGroups & osIdx], na.omit(os), bootstrap.samples=50, scope = which(groups %in% scope))
## ..................................................
## ..................................................
selectedIntOs <- names(which(coxCPSSIntOs$Pi > 0.8))
coxCPSSIntOs
##
## Stability selection:
## Variable P[select] P-value adj. P
## NPM1 0.98 0.00121 0.0155
## TP53 0.98 0.00121 0.0155
## CEBPA_bi 0.92 0.00413 0.0327
## FLT3_ITD 1.00 0.00038 0.0099
## minus7 0.92 0.00413 0.0327
## t_15_17 0.97 0.00164 0.0188
## inv16_t16_16 0.98 0.00121 0.0155
## complex 1.00 0.00038 0.0099
## Date_1000 1.00 0.00079 0.0155
## ATRA 0.84 0.00084 0.0874
## AOD_10 1.00 0.00038 0.0099
## Performance_ECOG 0.96 0.00210 0.0216
## wbc_100 1.00 0.00038 0.0099
## LDH_1000 0.94 0.00307 0.0287
##
## Corresponding coxph:
## Call:
## coxph(formula = na.omit(os) ~ NPM1 + TP53 + CEBPA_bi + FLT3_ITD +
## minus7 + t_15_17 + inv16_t16_16 + complex + Date_1000 + ATRA +
## AOD_10 + Performance_ECOG + wbc_100 + LDH_1000, data = dataFrame[!is.na(os),
## groups %in% mainGroups & osIdx])
##
##
## coef exp(coef) se(coef) z p
## NPM1 -0.4847 0.616 0.0830 -5.84 5.3e-09
## TP53 0.7601 2.139 0.1426 5.33 9.7e-08
## CEBPA_bi -0.9735 0.378 0.2196 -4.43 9.3e-06
## FLT3_ITD 0.4619 1.587 0.0856 5.39 6.9e-08
## minus7 0.4963 1.643 0.1333 3.72 2.0e-04
## t_15_17 -1.2482 0.287 0.2466 -5.06 4.2e-07
## inv16_t16_16 -1.2180 0.296 0.2334 -5.22 1.8e-07
## complex 0.5532 1.739 0.1214 4.56 5.1e-06
## Date_1000 -0.1510 0.860 0.0311 -4.86 1.2e-06
## ATRA -0.1936 0.824 0.0801 -2.42 1.6e-02
## AOD_10 0.3062 1.358 0.0318 9.63 0.0e+00
## Performance_ECOG 0.1115 1.118 0.0542 2.06 4.0e-02
## wbc_100 0.4852 1.625 0.0696 6.97 3.2e-12
## LDH_1000 0.0935 1.098 0.0527 1.77 7.6e-02
##
## Likelihood ratio test=509 on 14 df, p=0 n= 1540, number of events= 904
Stepwise model selection by BIC
c <- coxph(os ~ 1, data=dataFrame[,mainIdxOs & osIdx])
scopeStep <- as.formula(paste("os ~", paste(colnames(dataFrame)[mainIdxOs& osIdx], collapse="+")))
coxBICOs <- step(c, scope=scopeStep, k = log(sum(trainIdx)), trace=0)
summary(coxBICOs)
## Call:
## coxph(formula = os ~ AOD_10 + complex + wbc_100 + TP53 + inv16_t16_16 +
## Date_1000 + t_15_17 + CEBPA_bi + NPM1 + FLT3_ITD + inv3_t3_3 +
## SFRS2, data = dataFrame[, mainIdxOs & osIdx])
##
## n= 1540, number of events= 904
##
## coef exp(coef) se(coef) z Pr(>|z|)
## AOD_10 0.27345 1.31449 0.03075 8.892 < 2e-16 ***
## complex 0.72285 2.06029 0.11770 6.142 8.17e-10 ***
## wbc_100 0.50452 1.65618 0.06377 7.912 2.55e-15 ***
## TP53 0.82568 2.28344 0.13893 5.943 2.80e-09 ***
## inv16_t16_16 -1.16935 0.31057 0.23335 -5.011 5.41e-07 ***
## Date_1000 -0.17136 0.84252 0.02938 -5.833 5.45e-09 ***
## t_15_17 -1.20870 0.29859 0.24607 -4.912 9.01e-07 ***
## CEBPA_bi -0.93085 0.39422 0.21984 -4.234 2.29e-05 ***
## NPM1 -0.45461 0.63470 0.08310 -5.471 4.48e-08 ***
## FLT3_ITD 0.51234 1.66919 0.08554 5.990 2.10e-09 ***
## inv3_t3_3 1.14923 3.15576 0.22911 5.016 5.27e-07 ***
## SFRS2 0.38488 1.46943 0.12731 3.023 0.0025 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## AOD_10 1.3145 0.7608 1.2376 1.3962
## complex 2.0603 0.4854 1.6359 2.5948
## wbc_100 1.6562 0.6038 1.4616 1.8767
## TP53 2.2834 0.4379 1.7391 2.9981
## inv16_t16_16 0.3106 3.2199 0.1966 0.4907
## Date_1000 0.8425 1.1869 0.7954 0.8925
## t_15_17 0.2986 3.3491 0.1843 0.4836
## CEBPA_bi 0.3942 2.5367 0.2562 0.6066
## NPM1 0.6347 1.5756 0.5393 0.7470
## FLT3_ITD 1.6692 0.5991 1.4115 1.9739
## inv3_t3_3 3.1558 0.3169 2.0141 4.9445
## SFRS2 1.4694 0.6805 1.1449 1.8859
##
## Concordance= 0.707 (se = 0.01 )
## Rsquare= 0.282 (max possible= 1 )
## Likelihood ratio test= 509.3 on 12 df, p=0
## Wald test = 528.4 on 12 df, p=0
## Score (logrank) test = 589.2 on 12 df, p=0
With AIC
coxAICOs <- step(c, scope= scopeStep, k = 2, trace=0)
summary(coxAICOs)
## Call:
## coxph(formula = os ~ AOD_10 + complex + wbc_100 + TP53 + inv16_t16_16 +
## Date_1000 + t_15_17 + CEBPA_bi + NPM1 + FLT3_ITD + inv3_t3_3 +
## SFRS2 + ATRA + BM_Blasts_100 + IDH2_p172 + CEBPA_mono + RAD21 +
## Performance_ECOG + KRAS + sAML + minus7 + VPA + MissingCyto +
## BRAF + HB_10 + platelet_100 + DNMT3A + ZRSR2 + plus21 + ASXL1 +
## FLT3_other + IDH1 + plus22 + mono17_17p_abn17p + PTEN + t_v_11 +
## minus18_18q, data = dataFrame[, mainIdxOs & osIdx])
##
## n= 1540, number of events= 904
##
## coef exp(coef) se(coef) z Pr(>|z|)
## AOD_10 2.917e-01 1.339e+00 3.274e-02 8.907 < 2e-16 ***
## complex 5.447e-01 1.724e+00 1.357e-01 4.015 5.95e-05 ***
## wbc_100 4.113e-01 1.509e+00 6.786e-02 6.060 1.36e-09 ***
## TP53 7.884e-01 2.200e+00 1.552e-01 5.079 3.79e-07 ***
## inv16_t16_16 -1.353e+00 2.583e-01 2.520e-01 -5.370 7.86e-08 ***
## Date_1000 -1.730e-01 8.412e-01 3.205e-02 -5.396 6.80e-08 ***
## t_15_17 -1.321e+00 2.669e-01 2.544e-01 -5.192 2.08e-07 ***
## CEBPA_bi -1.067e+00 3.442e-01 2.251e-01 -4.739 2.15e-06 ***
## NPM1 -5.264e-01 5.907e-01 9.188e-02 -5.730 1.01e-08 ***
## FLT3_ITD 5.475e-01 1.729e+00 8.774e-02 6.240 4.37e-10 ***
## inv3_t3_3 1.033e+00 2.809e+00 2.686e-01 3.845 0.00012 ***
## SFRS2 3.922e-01 1.480e+00 1.303e-01 3.009 0.00262 **
## ATRA -2.122e-01 8.088e-01 8.106e-02 -2.618 0.00885 **
## BM_Blasts_100 4.422e-01 1.556e+00 1.516e-01 2.918 0.00353 **
## IDH2_p172 -5.463e-01 5.791e-01 2.516e-01 -2.171 0.02989 *
## CEBPA_mono -4.348e-01 6.474e-01 1.946e-01 -2.235 0.02545 *
## RAD21 -3.781e-01 6.851e-01 2.196e-01 -1.722 0.08510 .
## Performance_ECOG 1.542e-01 1.167e+00 5.599e-02 2.753 0.00590 **
## KRAS 3.733e-01 1.453e+00 1.555e-01 2.400 0.01639 *
## sAML 3.078e-01 1.360e+00 1.583e-01 1.945 0.05181 .
## minus7 3.358e-01 1.399e+00 1.537e-01 2.185 0.02892 *
## VPA 3.359e-01 1.399e+00 1.267e-01 2.652 0.00801 **
## MissingCyto 2.543e-01 1.290e+00 1.173e-01 2.168 0.03015 *
## BRAF 9.975e-01 2.711e+00 3.848e-01 2.592 0.00954 **
## HB_10 3.796e-01 1.462e+00 1.839e-01 2.064 0.03906 *
## platelet_100 -9.481e-02 9.095e-01 4.111e-02 -2.306 0.02111 *
## DNMT3A 1.672e-01 1.182e+00 8.820e-02 1.896 0.05803 .
## ZRSR2 7.085e-01 2.031e+00 3.138e-01 2.258 0.02395 *
## plus21 3.840e-01 1.468e+00 2.065e-01 1.859 0.06298 .
## ASXL1 2.545e-01 1.290e+00 1.429e-01 1.781 0.07485 .
## FLT3_other -2.877e-01 7.500e-01 1.680e-01 -1.712 0.08683 .
## IDH1 2.226e-01 1.249e+00 1.310e-01 1.699 0.08938 .
## plus22 4.735e-01 1.606e+00 2.875e-01 1.647 0.09960 .
## mono17_17p_abn17p 3.082e-01 1.361e+00 1.763e-01 1.748 0.08046 .
## PTEN -1.359e+01 1.249e-06 8.101e+02 -0.017 0.98661
## t_v_11 3.122e-01 1.366e+00 1.924e-01 1.623 0.10466
## minus18_18q -3.497e-01 7.049e-01 2.363e-01 -1.480 0.13887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## AOD_10 1.339e+00 7.470e-01 1.2554 1.4274
## complex 1.724e+00 5.800e-01 1.3216 2.2494
## wbc_100 1.509e+00 6.628e-01 1.3208 1.7233
## TP53 2.200e+00 4.546e-01 1.6228 2.9819
## inv16_t16_16 2.583e-01 3.871e+00 0.1576 0.4234
## Date_1000 8.412e-01 1.189e+00 0.7900 0.8957
## t_15_17 2.669e-01 3.747e+00 0.1621 0.4394
## CEBPA_bi 3.442e-01 2.905e+00 0.2214 0.5350
## NPM1 5.907e-01 1.693e+00 0.4934 0.7073
## FLT3_ITD 1.729e+00 5.784e-01 1.4558 2.0533
## inv3_t3_3 2.809e+00 3.560e-01 1.6593 4.7552
## SFRS2 1.480e+00 6.756e-01 1.1465 1.9110
## ATRA 8.088e-01 1.236e+00 0.6900 0.9481
## BM_Blasts_100 1.556e+00 6.426e-01 1.1562 2.0945
## IDH2_p172 5.791e-01 1.727e+00 0.3537 0.9482
## CEBPA_mono 6.474e-01 1.545e+00 0.4421 0.9480
## RAD21 6.851e-01 1.460e+00 0.4455 1.0537
## Performance_ECOG 1.167e+00 8.571e-01 1.0454 1.3020
## KRAS 1.453e+00 6.885e-01 1.0708 1.9702
## sAML 1.360e+00 7.351e-01 0.9976 1.8552
## minus7 1.399e+00 7.148e-01 1.0351 1.8908
## VPA 1.399e+00 7.147e-01 1.0916 1.7935
## MissingCyto 1.290e+00 7.754e-01 1.0247 1.6229
## BRAF 2.711e+00 3.688e-01 1.2754 5.7645
## HB_10 1.462e+00 6.842e-01 1.0192 2.0961
## platelet_100 9.095e-01 1.099e+00 0.8391 0.9859
## DNMT3A 1.182e+00 8.460e-01 0.9943 1.4050
## ZRSR2 2.031e+00 4.924e-01 1.0980 3.7566
## plus21 1.468e+00 6.811e-01 0.9794 2.2008
## ASXL1 1.290e+00 7.753e-01 0.9748 1.7068
## FLT3_other 7.500e-01 1.333e+00 0.5396 1.0425
## IDH1 1.249e+00 8.005e-01 0.9663 1.6150
## plus22 1.606e+00 6.228e-01 0.9139 2.8206
## mono17_17p_abn17p 1.361e+00 7.348e-01 0.9633 1.9228
## PTEN 1.249e-06 8.008e+05 0.0000 Inf
## t_v_11 1.366e+00 7.318e-01 0.9372 1.9924
## minus18_18q 7.049e-01 1.419e+00 0.4436 1.1201
##
## Concordance= 0.724 (se = 0.01 )
## Rsquare= 0.329 (max possible= 1 )
## Likelihood ratio test= 613.5 on 37 df, p=0
## Wald test = 608.7 on 37 df, p=0
## Score (logrank) test = 685.1 on 37 df, p=0
Time-dep AIC and BIC, including allografts
c <- coxph(osTD ~ 1, data=dataFrameOsTD[mainIdxOsTD])
scopeStep <- as.formula(paste("osTD ~", paste(colnames(dataFrameOsTD)[mainIdxOsTD], collapse="+")))
coxBICOsTD <- step(c, scope=scopeStep, k = log(nrow(dataFrame)), trace=0)
coxAICOsTD <- step(coxBICOsTD, scope=scopeStep, k = 2, trace=0)
Load data
tcgaClinical <- read.table("../data/TCGA_Clinical.txt", sep="\t", header=TRUE)
tcgaGenetic <- read.table("../data/TCGA_Genetic.txt", sep="\t", header=TRUE)
tcgaGenetic$TCGA_ID <- factor(as.character(tcgaGenetic$TCGA_ID), levels = levels(tcgaClinical$TCGA_ID))
g <- as.character(tcgaGenetic$Hugo_Symbol)
g[tcgaGenetic$Hugo_Symbol=="FLT3" & tcgaGenetic$Variant_Type == 'INS'] <- "FLT3_ITD"
g[tcgaGenetic$Hugo_Symbol=="FLT3" & tcgaGenetic$Variant_Type == 'SNP'] <- "FLT3_TKD"
tcgaMutation <- (table(tcgaGenetic$TCGA_ID,g)) + 0
t <- data.frame(tcgaMutation[,]>0, CEBPA_mono = tcgaMutation[,"CEBPA"]==1,CEBPA_bi = tcgaMutation[,"CEBPA"]>1,tcgaClinical[,-c(1,2,4,5,6,13,25)], MakeInteger(tcgaClinical$TypeAML)) + 0
w <- grep("_10+$",colnames(dataFrame), value=TRUE)
f <- as.numeric(sub(".+_","",w))
n <- sub("_10+","",w)
f <- f[n %in% colnames(tcgaClinical)]
n <- n[n %in% colnames(tcgaClinical)]
t[n] <- t[n] / rep(f, each=nrow(t))
colnames(t)[match(n,colnames(t))] <- paste(n,f,sep="_")
rm(w,n,f,g)
tcgaData <- dataFrame[1:nrow(t),]
tcgaData[,] <- NA
w <- intersect(names(t), names(tcgaData))
tcgaData[w] <- t[w]
tcgaData$TPL_os <- NA
tcgaData[groups=="Genetics"][is.na(tcgaData[groups=="Genetics"])] <- 0
tcgaData$MissingCyto <- (tcgaClinical$karyotype == '[Not Available]' )+0
rm(t,w)
tcgaSurvival <- Surv(tcgaClinical$OS/365, tcgaClinical$Status)
tb <- read.xlsx("../data/TCGA_SupplementalTable01.xlsx", 1, colIndex=1:29)
tb <- tb[order(tb$TCGA.Patient.ID),]
tt <- strsplit(as.character(tb$Trnsplt), ", ")
tp <- strsplit(as.character(tb$Dz.Stat....trnsplt),", ")
tcgaTpl <- t(sapply(1:nrow(tb) , function(i){
transplantCR1=0; transplantRel=0
if(tt[i] != "0") {
a <- tt[i]%in%c("MUD","sib Allo") & !tp[i] %in% c("Refr dz","refr dz","refr dz post induction","xxxxx","aplastic post chemo","0")
if(any(a)){
if(any(a & tp[i] %in% c("CR1","CR 1"))) transplantCR1 <- 1
if(any(a & !tp[i] %in% c("CR1","CR 1"))) transplantRel <- 1
}
}
return(c(transplantCR1=transplantCR1, transplantRel=transplantRel))
}) )
tcgaData$TPL_os <- tcgaTpl[,"transplantCR1"]
Plot mutation frequencies
plot(colMeans(dataFrame[groups=="Genetics"])[colnames(tcgaMutation)], colMeans(tcgaMutation))
text(colMeans(dataFrame[groups=="Genetics"])[colnames(tcgaMutation)], colMeans(tcgaMutation), colnames(tcgaMutation))
cor(colMeans(dataFrame[groups=="Genetics"])[colnames(tcgaMutation)], colMeans(tcgaMutation), use='c')
## [1] 0.9320114
NPM1 survival
plot(survfit(tcgaSurvival ~ NPM1, data=tcgaData), col=set1[1:2])
lines(survfit(osYr ~ NPM1, data=dataFrame), col=set1, lty=3,mark=NA)
legend("topright", col=c(set1[1:2],"black","black"), c("NPM1 wt", "NPM1 mut","TCGA","AML"), lty=c(1,1,1,3), bty='n')
CoxRFX model and covariance-based imputation
tcgaRiskRFXOs <- PredictRiskMissing(coxRFXFitOsTDGGc, tcgaData[whichRFXOsTDGG])
survConcordance(tcgaSurvival ~ tcgaRiskRFXOs[,1])
## Call:
## survConcordance(formula = tcgaSurvival ~ tcgaRiskRFXOs[, 1])
##
## n=186 (14 observations deleted due to missingness)
## Concordance= 0.7010229 se= 0.02968944
## concordant discordant tied.risk tied.time std(c-d)
## 9046.0000 3858.0000 0.0000 162.0000 766.2249
CPSS model
tcgaDataImputed <- as.data.frame(ImputeMissing(dataFrame[mainIdxOs], newX=tcgaData[mainIdxOs]))
tcgaRiskCPSSOs <- predict(coxCPSSIntOs, newdata=tcgaDataImputed)
survConcordance(tcgaSurvival ~ tcgaRiskCPSSOs)
## Call:
## survConcordance(formula = tcgaSurvival ~ tcgaRiskCPSSOs)
##
## n=186 (14 observations deleted due to missingness)
## Concordance= 0.691646 se= 0.02968944
## concordant discordant tied.risk tied.time std(c-d)
## 8925.0000 3979.0000 0.0000 162.0000 766.2249
Blind imputation (mean only)
f <- function(X) {X <- sapply(X, poorMansImpute);X[is.na(X)] <- 0; X}
survConcordance(tcgaSurvival ~ predict(coxCPSSIntOs, newdata=as.data.frame(f(tcgaData[mainIdxOs]))))
## Call:
## survConcordance(formula = tcgaSurvival ~ predict(coxCPSSIntOs,
## newdata = as.data.frame(f(tcgaData[mainIdxOs]))))
##
## n=186 (14 observations deleted due to missingness)
## Concordance= 0.687035 se= 0.02968941
## concordant discordant tied.risk tied.time std(c-d)
## 8865.0000 4038.0000 1.0000 162.0000 766.2243
Cytogenetic risk
survConcordance(tcgaSurvival ~ c(3,1,2)[tcgaClinical$C_Risk])
## Call:
## survConcordance(formula = tcgaSurvival ~ c(3, 1, 2)[tcgaClinical$C_Risk])
##
## n=183 (17 observations deleted due to missingness)
## Concordance= 0.5996288 se= 0.02657258
## concordant discordant tied.risk tied.time std(c-d)
## 4741.0000 2272.0000 5378.0000 150.0000 658.5218
PINA score (Pastore et al. 2014) for NK AML.
PINAOs <- function(X){
coef <- c( NPM1=-1.2,
FLT3_ITD=-.26,
`NPM1:FLT3_ITD`=.89,
CEBPA_bi=-1.3,
wbc_log10=.57,
age=0.044,
ecog24=.4)
x <- cbind(X[,colnames(X) %in% names(coef)], wbc_log10 = log10(100*1e3*pmax(X[,"wbc_100"], 0.001)), age = X[,"AOD_10"]*10, ecog24 = X[,"Performance_ECOG"]>=2)
risk <- as.matrix(x[,names(coef)]) %*% coef
group <- cut(risk, c(min(risk), 4,5.4, max(risk)), labels = c("low","int","high"))
return(data.frame(risk, group))
}
pinaOs <- PINAOs(dataFrame)
nkIdx <- clinicalData$NK == 1
plot(survfit(os[nkIdx] ~ pinaOs[nkIdx,2]), col=rev(set1[1:3]))
survConcordance(os[nkIdx] ~ pinaOs[nkIdx,1])
## Call:
## survConcordance(formula = os[nkIdx] ~ pinaOs[nkIdx, 1])
##
## n=694 (92 observations deleted due to missingness)
## Concordance= 0.6611361 se= 0.01526872
## concordant discordant tied.risk tied.time std(c-d)
## 123745.000 63425.000 1.000 78.000 5715.723
Compared to CPSS (AML data)
survConcordance(os[nkIdx] ~ predict(coxCPSSIntOs, newdata=dataFrame)[nkIdx])
## Call:
## survConcordance(formula = os[nkIdx] ~ predict(coxCPSSIntOs, newdata = dataFrame)[nkIdx])
##
## n=694 (92 observations deleted due to missingness)
## Concordance= 0.6889582 se= 0.01526872
## concordant discordant tied.risk tied.time std(c-d)
## 128953.000 58218.000 0.000 78.000 5715.723
And on TCGA data
tcgaPinaOs <- PINAOs(cbind(tcgaDataImputed, `NPM1:FLT3_ITD` = tcgaDataImputed[,"NPM1"]*tcgaDataImputed[,"FLT3_ITD"]))
tcgaNkIdx <- tcgaClinical$karyotype == "Normal"
survConcordance(tcgaSurvival[tcgaNkIdx] ~ tcgaPinaOs[tcgaNkIdx,1])
## Call:
## survConcordance(formula = tcgaSurvival[tcgaNkIdx] ~ tcgaPinaOs[tcgaNkIdx,
## 1])
##
## n=96 (6 observations deleted due to missingness)
## Concordance= 0.6335756 se= 0.04038263
## concordant discordant tied.risk tied.time std(c-d)
## 2179.0000 1260.0000 1.0000 29.0000 277.8325
survConcordance(tcgaSurvival[tcgaNkIdx] ~ tcgaRiskCPSSOs[tcgaNkIdx])
## Call:
## survConcordance(formula = tcgaSurvival[tcgaNkIdx] ~ tcgaRiskCPSSOs[tcgaNkIdx])
##
## n=96 (6 observations deleted due to missingness)
## Concordance= 0.6575581 se= 0.04038309
## concordant discordant tied.risk tied.time std(c-d)
## 2262.0000 1178.0000 0.0000 29.0000 277.8357
ELN score (Döhner et al. 2010)
ELN <- function(X, nkIdx){
factor(ifelse(X$inv3_t3_3==1 | X$t_6_9==1 | X$minus5_5q==1 | X$mono17_17p_abn17p==1 | X$minus7==1 | X$complex==1 | X$t_v_11==1,
"Adverse",
ifelse(X$t_15_17==1 | X$t_8_21==1 | X$inv16_t16_16==1 | ((X$CEBPA_bi==1 | X$CEBPA_mono==1 | (X$NPM1==1 & X$FLT3_ITD==0)) & nkIdx),
"Favorable",
ifelse(nkIdx & (X$FLT3_ITD==1 | X$NPM1==0 & X$FLT3_ITD==0),
"Inter-1", "Inter-2"))), levels=rev(c("Adverse","Inter-1","Inter-2","Favorable")))
}
table(clinicalData$M_Risk, ELN(dataFrame, nkIdx))
##
## Favorable Inter-2 Inter-1 Adverse
## Adverse 0 0 0 253
## Favorable 457 6 6 4
## Inter-1 17 0 400 0
## Inter-2 1 251 7 9
Other models
tcgaRisk <- data.frame(
#stdRisk = c(3,1,2)[tcgaClinical$C_Risk],
ELN = as.numeric(ELN(tcgaDataImputed, tcgaNkIdx)),
tree = predict(tree, newdata=tcgaDataImputed),
rForest = predict(rForest, newdata = tcgaDataImputed, importance="none")$predicted,
PINAos = tcgaPinaOs[,1],
coxRFX = tcgaRiskRFXOs[,1],
coxBIC = predict(coxBICOs, newdata=tcgaDataImputed),
coxAIC = predict(coxAICOs, newdata=tcgaDataImputed),
coxCPSS = tcgaRiskCPSSOs
)
Concordance of all models
tcgaConcordance <- sapply(tcgaRisk, function(x) {c <- survConcordance(tcgaSurvival ~ x); c(c$concordance, c$std.err)})
tcgaConcordance
## ELN tree rForest PINAos coxRFX coxBIC coxAIC coxCPSS
## concordant 0.57927774 0.64813236 0.68087415 0.66126782 0.70102294 0.69032858 0.68769374 0.69164600
## std(c-d) 0.02815496 0.02739215 0.02968944 0.02968936 0.02968944 0.02968944 0.02968944 0.02968944
o <- order(tcgaConcordance[1,])
barplot(tcgaConcordance[1,o], border=NA, col= set1[-6], las=2, xaxt="n", ylab="Concordance", ylim=c(0.5,0.75), xpd=FALSE) -> b
segments(b,tcgaConcordance[1,o]-tcgaConcordance[2,o],b,tcgaConcordance[1,o]+tcgaConcordance[2,o])
rotatedLabel(b, rep(0.49,length(b)), colnames(tcgaConcordance)[o], srt=45)
AUC of all models
library(survAUC)
library(survivalROC)
tcgaAUC <- sapply(tcgaRisk, function(x) AUC.uno(na.omit(os), tcgaSurvival[!is.na(x) & !is.na(tcgaSurvival)], scale(x)[!is.na(tcgaSurvival) &! is.na(x)], c(90,365,1000)/365)$auc)
tcgaAUCi <- sapply(tcgaRisk, function(x) AUC.uno(na.omit(os), tcgaSurvival[!is.na(x) & !is.na(tcgaSurvival)], scale(x)[!is.na(tcgaSurvival) &! is.na(x)], sort(na.omit(tcgaSurvival[,1])))$iauc)
o <- order(colMeans(tcgaAUC))
barplot(tcgaAUC[,o], border=1, col= rep(c("grey",set1[-6]),each=3), las=2, xaxt="n", ylab="AUC", beside=TRUE, density=c(NA, 48,24), ylim=c(0.5,0.85), xpd=FALSE) -> b
legend("topleft", bty="n", c("3mo","1yr","3yr"), fill='black', density=c(NA, 48,24))
rotatedLabel(b[seq(3, length(b), 3)], rep(0.49,length(tcgaRisk)), names(tcgaRisk)[o], srt=45)
KM curves for four risk categories (quartiles)
risk <- cut(tcgaRiskRFXOs[,1], quantile(tcgaRiskRFXOs[,1]), labels=c("1st Q","2nd Q","3rd Q","4th Q"))
s <- survfit(tcgaSurvival ~ risk)
plot(s, col=set1[c(3,2,4,1)], mark=NA, xlab="Years", ylab="Survival")
legend("topright", bty="n", rownames(summary(s)$table), col=set1[c(3,2,4,1)], lty=1)
Distribution of risk v cytogenic categories
risk <- tcgaRiskRFXOs[,1] - mean(tcgaRiskRFXOs[,1])
x <- seq(from=-4,to=4, l=512)
r <- sapply(levels(tcgaClinical$C_Risk)[c(2,3,1)], function(r){
i <- tcgaClinical$C_Risk==r
d <- density(na.omit(risk[i]), from=-4,to=4)$y * mean(i, na.rm=TRUE)
})
par(mar=c(4,4,3,4)+.1, bty="n")
plot(exp(x),rowSums(r), type='l', lty=0,xlab="Hazard", ylab="Prop. patients", log='x', ylim=c(0,.55))
for(i in 1:3)
polygon(exp(c(x, rev(x))), c(rowSums(r[,1:i, drop=FALSE]), rev(rowSums(cbind(0,r)[,1:i, drop=FALSE]))), col=set1[c(3,2,1)][i], border=NA)
H0 <- basehaz(coxph(tcgaSurvival ~ risk), centered=TRUE)
hazardDist <- splinefun(H0$time, H0$hazard, method="monoH.FC")
invHazardDist <- splinefun(H0$hazard, H0$time, method="monoH.FC")
l <- c(0.1,.5,.9)#c(0.1,0.25,.5,.75,.9)
for(i in seq_along(l))
lines(exp(x), pmax(0,invHazardDist(-log(l[i]) /exp(x) ))/10000*365, col='black', lty=c(2,1,2)[i])
axis(side=4, at=seq(0,.5,0.1), labels=seq(0,.5,.1)*10000*365)
mtext(side=4, "Time", line=2.5)
mtext(side=3, at = -log(l)/hazardDist(par("usr")[4]*10000*365), text=paste(100*l, "% survive", sep=""))
legend("topright", levels(tcgaClinical$C_Risk)[c(2,3,1)], fill=set1[c(3,2,1)], bty="n", title="M risk")
d <- tcgaData
d$transplantRel <- tcgaTpl[,"transplantRel"]
d$transplantCR1 <- tcgaTpl[,"transplantCR1"]
d$MissingCyto <- (tcgaClinical$karyotype == '[Not Available]' )+0
multiRfx5Tcga <- MultiRFX5(coxRFXNcdTD, coxRFXCrTD, coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, d, tdPrmBaseline = tdPrmBaseline, tdOsBaseline = tdOsBaseline, x=xmax)
mRFX3yr <- colSums(multiRfx5Tcga[3*365,1:3,])
survConcordance(tcgaSurvival ~ mRFX3yr)
## Call:
## survConcordance(formula = tcgaSurvival ~ mRFX3yr)
##
## n=186 (14 observations deleted due to missingness)
## Concordance= 0.6878487 se= 0.02968944
## concordant discordant tied.risk tied.time std(c-d)
## 8876.0000 4028.0000 0.0000 162.0000 766.2249
TCGA concordance time-dependent models
tcgaDataTdImputed <- as.data.frame(ImputeMissing(dataFrame[mainIdxOsTD], newX=tcgaData[mainIdxOsTD]))
tcgaRiskTD <- data.frame(
coxBICTD = predict(coxBICOsTD, newdata=tcgaDataTdImputed),
coxAICTD = predict(coxAICOsTD, newdata=tcgaDataTdImputed),
coxRFXTD = PredictRiskMissing(coxRFXFitOsTDGGc, tcgaData)[,1],
mRFX3yr = mRFX3yr
)
tcgaConcordanceTD <- sapply(tcgaRiskTD, function(x) unlist(survConcordance(tcgaSurvival ~ x)[c("concordance","std.err")]))
s <- rowMeans(colSums(aperm(multiRfx5Tcga[,1:3,],c(2,1,3))))
plot(survfit(tcgaSurvival ~ 1))
lines(seq(0,2000)/365.25/1.25,1-s)
multiRfx5TcgaC <- sapply(seq(1,2000,10), function(i) survConcordance(tcgaSurvival ~ colSums(multiRfx5Tcga[i,1:3,]))$concordance)
plot(seq(1,2000,10)/365.25,multiRfx5TcgaC, type='l', xlab="Years after diagnosis", ylab="Concordance", col=set1[1])
abline(h=tcgaConcordanceTD[1,"coxRFXTD"],col=set1[2])
legend("bottomright",c("RFX OS","RFX Multistage"), col=set1[2:1], lty=1, bty="n")
times <- seq(1,2000,10)
s <- summary(survfit(tcgaSurvival ~ 1), times=times/365.25)
c <- summary(survfit(coxRFXFitOsTDGGc), times=times)
e <- sapply(times, function(t) ape(s$surv[times==t], tcgaSurvival, t/365.25))
a <- sapply(times, function(t) ape(1-colSums(multiRfx5Tcga[t,1:3,]), tcgaSurvival, t/365.25))
b <- sapply(times, function(t) ape(c$surv[times==t] ^ exp(tcgaRiskTD$coxRFXTD - mean(coxRFXFitOsTDGGc$means %*% coxRFXFitOsTDGGc$coefficients)), tcgaSurvival, t/365.25))
for(i in 1:4){
plot(times/365.25, e[i,], type='l', xlab="Time (yr)", ylab=rownames(a)[i], col=set1[9])
lines(times/365.25, a[i,], col=set1[1])
lines(times/365.25, b[i,], col=set1[2])
legend("bottomright",c("Kaplan-Meier","Multistage","RFX"), col=set1[c(9,1:2)], lty=1, bty="n")
}
Here we generate the overview shown in Supplementary Figure S1A.
library(abind)
par(mar=c(3,3.5,.5,.5),bty="n", mgp=c(2.5,.5,0), las=2, lend=1, xpd=FALSE)
o <- c(1,7,2,3,4,6)
x <- rbind(allModelsCvC[o,], allModelsCvTdC[c("BIC","AIC","RFXgg","mRFX3yr"),])
col <- brewer.pal(4,"Set1")
#boxplot(t(x[o,]), notch=TRUE, ylab="Concordance", staplewex=0, lty=1, pch=16, xaxt="n", border="white", ylim=c(0.5,0.75), boxwex=.5)
bplot <- function(x, at=1:ncol(x),..., ylim=range(x), xlab="", col="black", col.lines="grey"){
y <- apply(x,2,fivenum)
plot(at,y[3,], pch=NA, ..., ylim=ylim, xlab="", xaxt="n")
segments(at,y[1,],at,y[5,], col=col.lines, lwd=2)
segments(at,y[2,],at,y[4,], col=col.lines, lwd=5)
points(at,y[3,], pch=15, col=col)
}
s <- .2 #space
a <- c(1:6, 7:10+.5)
bplot(t(x), at=a-1.5*s,ylab="Concordance", ylim=c(0.5,0.75), xlim=range(a)+c(-.5,.5))
abline(h=seq(.5,.75,.05), col="lightgrey")
par(xpd=NA)
t <- tcgaConcordance[,c(1,3,6,7,8,5)]
z <- abind(abind(TCGA=t, allModelsTrialC[,o,]), abind(TCGA=tcgaConcordanceTD, allModelsTrialTdC[,c("BIC","AIC","RFXgg","mRFX3yr"),]), along=2)
m <- sapply(1:ncol(z),function(i){
err <- 1 / sum(1/z[2,i,]^2)
avg <- sum(z[1,i,] /z[2,i,]^2) * err
c(avg,sqrt(err))})
#segments(1:6+s/2,m[1,]-m[2,],1:6+s/2,m[1,]+m[2,], lwd=2, col="#00000044")
#points(m[1,], pch=19, cex=1.5)
#segments(a-s/2,t[1,]-t[2,],a-s/2,t[1,]+t[2,], col=paste0(col[1],"FF"), lwd=2)
#points(a-s/2,t[1,], col=col[1], pch=16, cex=1)
i <- 0; for(n in dimnames(z)[[3]]) { i<-i+1;
segments(a -s +s/2*i, z[1,,n] - z[2,,n],a -s +s/2*i, z[1,,n]+ z[2,,n], col=mg14::colTrans(col[i]), lwd=2)
points(a -s +s/2*i, z[1,,n], col=col[i], pch=16, cex=1)
}
segments(a -3/4*s, m[1,],a+s*5/4,m[1,], lwd=3)
mg14::rotatedLabel(a, labels= rownames(x))
legend("bottomright",
c(
"random CV 4/5 x100",
paste0("TCGA, (n=",nrow(na.omit(tcgaSurvival)),")"),
paste0(dimnames(allModelsTrialC)[[3]]," (n=",table(clinicalData$Study),")"),
"average"),
lty=c(1,1), bg="white", col=c("grey",col[1:4], "black"), pch=c(15,16,16,16,16,NA))
Short version
par(mar=c(3,3.5,.5,.5),bty="n", mgp=c(2.5,.5,0), las=2, lend=1, xpd=FALSE)
r <- sapply(as.data.frame(lapply(as.data.frame(t(apply(-x,2,rank, ties.method="random"))),factor, levels=1:nrow(x))),table)
o <- 1:ncol(r) #order(apply(allModelsCvTdC[w,],1,median))
clr <- rev(brewer.pal(nrow(r),"PiYG"))#set1[c(3,2,4,1,5,7)]
barplot(r[,o]/replicates, col=clr[1:ncol(allModelsCvTdC)], ylab="Fraction", names.arg=rep("",ncol(r))) -> b
mg14::rotatedLabel(1:ncol(r), rep(par("usr")[3],ncol(x)), colnames(r)[o])
legend(par("usr")[1],1.5, fill=clr[1:nrow(r)], legend=1:nrow(r), bty="n", border=NA, horiz=TRUE, title="Rank")
We use simulations to assess different properties of our risk modelling approach.
Simulating survival times is useful, for example, to verify the consistency of our estimators and obtain empirical confidence intervals.
In the Cox proportional hazards model, the hazard is given by:
\[ \lambda(t) = \lambda_0(t) \exp(u Z) = -\frac{dS(t)}{dt}\frac{1}{S(t)}.\]
On the transformed time-scale \(\tau(t) = \int_0^t \lambda_0(t') dt'\), the hazard is constant and survival times are distributed exponentially. A strategy to model survival times according to the Cox proportional hazards model is therefore to draw unit survival times \(\tau ~ \operatorname{Exp}(u Z)\) and to scale those according to \(\tau^{-1}\).
The observed survival times \(T_o\) are subject to censoring. The generative process can be thought of as \(T_o = \min\{T, T_c\}\), where \(T_c\) is a censoring time and \(T\) the actual survival. This process may be simulated by estimating the cumulative distribution of censoring times \(\hat F(T_c)\) using the Kaplan-Meier estimator and subsequently simulating censoring times \(T_c = \hat F^{-1}(U); U \sim \operatorname{Unif}(0,1)\).
The simulated times and events are then \(T_o = \min\{T, T_c\}\) and the status is 1 if \(T < T_c\) and 0 otherwise.
Hence our algorithm to simulate survival times can be summarised as follows:
L_0(t)Linv_0(x)Fcens(t)Finvh, transform using T=Linv_0(rexp(h))T_c=Finv(runif())pmin(T_c,T).T < SThis algorithm is implemented in CoxHD::SimSurvNonp().
We use interpolations subsampling patients and genes to assess the influence of cohort size and breadth of genomic sequencing on our predictive performance.
We take the following approach:
We observed that the key determinant for the variance of the genetic log hazard is the average number of (genetic) drivers/patient
For a given size n repeat r times:
n using sample()CoxRFX modelCompute the average concordance across all repetitions r. The number of repetitions r was chosen for each n such that the test size 1540 - n was constant in order to achieve a similar error of the average concordance.
Here we use a non-parametric approach to simulate data sets of larger cohorts to extrapolate influence of cohort size on prognostic accuracy. We also use a parametric approach to quantify the relation between the number of genes sequenced and model performance.
To extrapolate to larger cohort size we need to simulate new patients, distributed according to the empirical distribution. We observed that a simple resampling exaggerates the effect of interaction terms as particular constellations will be overrepresented. We therefore resampled patients and variables and used a multiple imputation package to impute the missing variables, noting that this will more likely generate non-duplicate data points that still satisfy the empirical distribution of the original data.
So we used the following steps:
mice (van Buuren and Groothuis-Oudshoorn 2011) using 10 chains.This protocol is implemented as CoxHD::SimData()
One observation made during [subsampling of genes][#subsampling-of-genes] was that the predicted variation of risk was a linear function of the average number of drivers/patient. Here we derive the theoretical groundwork supporting this observation.
Let \(Z\) be the set of genetic predictors and \(u \sim N(\mu;\sigma)\) the distribution of effect sizes. Then the variation in log hazard is given by
\[ \begin{align} Var[h] = Var[u^T Z] &= E[Var[u^T Z|Z]] + Var[E[u^T Z | Z]] \cr &= E[ Z^T Var[u] Z ] + \mu^2 Var[\textstyle\sum_i Z_i] \cr &= \sigma^2 E[ \textstyle\sum_i Z^2_i] + \mu^2 Var[\textstyle\sum_i Z_i] \cr &= \sigma^2 E[ \textstyle\sum_i Z_i] + \mu^2 Var[\textstyle\sum_i Z_i] \qquad Z_i^2 = Z_i \in \{0,1\} \cr &= \sigma^2 E[D] + \mu^2 Var[D] \end{align} \]
Where \(D=\sum_i Z_i\) denotes the total number of drivers per patient. The latter term \(\mu^2 Var[D]\) can be ignored as long as \(D \approx 1\) and \(|\mu| < 1\). Hence the variation in the log hazard increases proportionally to the mean number of drivers.
Note: These derivations hold for an additive model.
For interactions \[E[Var[Z^T B Z | Z]] = \sigma^2 E[\sum Z Z^T] =\sigma^2 E[I]\] I being the number of interaction terms. \(I < D(D-1)/2\).
On TCGA data we can estimate the number of drivers by means of the sum of the excess of non-synonymous over synonymous mutations at each gene (Martincorena and others 2015). We use the total number of indels as an upper bound for the number of driver indels.
Using this approach we detect an average of 2.3 point mutations and 1.4 indels adding to 3.7 drivers per AML case when considering the entire exome. This compares to an average of 1.55 driver substitutions and 0.94 driver indels, with a total of 2.3 mutations (excluding multiple mutations in the same gene) in our cohort, having sequenced the 111 most prevalent driver genes.
It therefore appears that the variance explained due to the number of genes considered could be increased by approximately 50%.
Subsampling patients
library(survivalROC)
set.seed(42)
subsets <- seq(100,1500,100)
subsetPatients <- lapply(subsets, function(s){
mclapply(1:ceiling(50000/(1540-s)), function(foo){
set.seed(s*foo)
trn <- 1:nrow(dataFrame) %in% sample(nrow(dataFrame), s)
tst <- !trn
fit <- CoxRFX(dataFrameOsTD[tplSplitOs[trn], whichRFXOsTDGG], osTD[tplSplitOs[trn]], groups[whichRFXOsTDGG], which.mu=mainGroups, nu = 0.1)
C <- survConcordance(osTD[tplSplitOs[tst]]~predict(fit, newdata=dataFrameOsTD[tplSplitOs[tst], whichRFXOsTDGG]))
ROC <- survivalROC(Stime=os[!is.na(os) & tst,1], status=os[!is.na(os) & tst,2], marker = predict(fit, newdata=dataFrame[tst, whichRFXOsTDGG]), predict.time = 850, method="KM", cut.values=seq(-5,5,0.1))
list(C, ROC, trn, tst, coef(fit))}, mc.cores=10)
})
#pdf("subsetConcordance.pdf", 2.5,2.5, pointsize=8)
col1 <- colorRampPalette(set1[c(3,2,4,1,5)])(length(subsets))
plot(NA,NA, xlim=c(0,1),ylim=c(0,1), xlab="FPR",ylab="TPR")
abline(0,1, lty=3)
for(i in seq_along(subsets)){
x <- sapply(subsetPatients[[i]], function(x) x[[2]]$FP)
y <- sapply(subsetPatients[[i]], function(x) x[[2]]$TP)
lines(rowMeans(x),rowMeans(y), col=col1[i], type="l")
}
#legend("bottomright", legend=rev(subsets), lty=1, col=col1[5:1], bty="n")
rangeplot2 <- function(x, y, col = 1, pch = 19, lty = 1, ylim=range(unlist(y)),...){
plot(x, sapply(y, mean), col = col, pch=pch, ylim = ylim, ..., xaxt="n")
points(jitter(unlist(sapply(seq_along(y), function(i) rep(x[i], length(y[[i]]))))),unlist(y),pch=1, col=unlist(sapply(seq_along(y), function(i) rep(col[i], length(y[[i]])))), cex=.2)
lines(x, sapply(y, mean), lwd=2)
lines(x, sapply(y, mean) + 2*sapply(y, sd)/sqrt(sapply(y,length)))
lines(x, sapply(y, mean) - 2*sapply(y, sd)/sqrt(sapply(y,length)))
axis(at = x, labels=x, side=1)
#segments(x,apply(y,2,min),x,apply(y,2,max), col=col, lty = lty)
}
rangeplot3 <- function(x, y, col = 1, pch = 19, lty = 1, ylim=range(unlist(y)),...){
plot(x, sapply(y, mean), col = col, pch=pch, ylim = ylim, ...)
#points(jitter(unlist(sapply(seq_along(y), function(i) rep(x[i], length(y[[i]]))))),unlist(y),pch=1, col=unlist(sapply(seq_along(y), function(i) rep(col[i], length(y[[i]])))), cex=.2)
#lines(x, sapply(y, mean), lwd=2)
s <- sapply(y, sd)/sqrt(sapply(y,length))
m <- sapply(y, mean)
segments(x, m+s*2, x, m-s*2, col=col)
#axis(at = x, labels=x, side=1)
#segments(x,apply(y,2,min),x,apply(y,2,max), col=col, lty = lty)
}
rangeplot2(x=subsets, y = sapply(subsetPatients, function(x) sapply(x, function(y) y[[2]]$AUC)) , col=col1, xlab="Cohort", ylab="AUC", lty=1, ylim=c(0.7,0.85))
rangeplot2(x=subsets, y = sapply(subsetPatients, function(x) sapply(x, function(y) y[[1]]$concordance)) , col=col1, xlab="Cohort", ylab="Concordance", lty=1, ylim=c(0.65,.75), log='')
rangeplot3(x=subsets, y = sapply(subsetPatients, function(x) sapply(x, function(y) y[[1]]$concordance)) , col=col1, xlab="Cohort", ylab="Concordance", lty=1, ylim=c(0.67,.73), log='')
#lines(x=subsets, y = concordanceFromVariance(sapply(subsetPatients, function(x) {
# mean(sapply(x, function(y) {
# h <- var(as.matrix(dataFrameOsTD[tplSplitOs[y[[3]]],whichRFXOsTDGG]) %*% y[[5]])
# }))
# })) , col=1, xlab="Cohort", ylab="Concordance", ylim=c(0.65,.75))
#
Subsampling genes
set.seed(42)
subsets <- seq(5,55,5)
genes <- names(whichRFXOsTDGG[groups=="Genetics"])
subsetGenes <- lapply(subsets, function(s){
mclapply(1:100, function(foo){
g <- sample(genes, s)
ix <- !grepl(paste(g,collapse="|"), names(whichRFXOsTDGG))
trainIdx <- sample(1:nrow(dataFrame)%%5 +1 )!=1 ## sample 1/5
testIdx <- !trainIdx
fit <- CoxRFX(dataFrameOsTD[tplSplitOs[trainIdx], whichRFXOsTDGG[ix]], osTD[tplSplitOs[trainIdx]], groups[whichRFXOsTDGG[ix]], which.mu=mainGroups, nu = 0.1)
C <- survConcordance(osTD[tplSplitOs[testIdx]]~predict(fit, newdata=dataFrameOsTD[tplSplitOs[testIdx], whichRFXOsTDGG[ix]]))
ROC <- survivalROC(Stime=os[!is.na(os) & testIdx,1], status=os[!is.na(os) & testIdx,2], marker = predict(fit, newdata=dataFrame[testIdx, whichRFXOsTDGG[ix]]), predict.time = 850, method="KM", cut.values=seq(-5,5,0.1))
fit <- CoxRFX(dataFrameOsTD[, whichRFXOsTDGG[ix]], osTD, groups[whichRFXOsTDGG[ix]], which.mu=mainGroups, nu = 0.1)
S <- cov(PartialRisk(fit))
list(C, ROC, S, trainIdx, testIdx, ix, mean(rowMeans(dataFrame[setdiff(genes,g)])))
}, mc.cores=10)
})
plot(sapply(subsetGenes, function(x) sapply(x, function(y) y[[7]]*sum(y[[6]][1:58]))), sapply(subsetGenes, function(x) sapply(x, function(y) {t <- try(sum(y[[3]][c("Genetics","GeneGene"),c("Genetics","GeneGene")])); ifelse(class(t)=="try-error",NA,t)})), xlab="Mean no. of drivers", ylab=expression(paste(Var,"[",h[g],"]")), xlim=c(0,3.8), ylim=c(0,.35), pch=16, col=c("#00000044"))
x <- c(0,3.7)
s <- coxRFXFitOsTDGGc$sigma2["Genetics"]
segments(c(2.3, 3.7), rep(par("usr")[3],2), c(2.3, 3.7), c(2.3, 3.7) * s, col="grey")
segments( rep(par("usr")[1],2), c(2.3, 3.7) * s, c(2.3, 3.7), c(2.3, 3.7) * s, col="grey")
lines(x, x*s, col="red")
par(xpd=NA)
axis(at=c(2.3, 3.7), labels=c("111 genes", "TCGA (exome)"), tcl=0.5, side=1, mgp=c(-2.5,-2,0))
Simulate data using multiple imputation.
set.seed(42)
SimDataNonp
## function (oldData, nData, percentMissing = 0.33, ...)
## {
## require(mice)
## oldData <- oldData[!apply(is.na(oldData), 1, all), ]
## for (i in 1:nrow(oldData)) {
## while (TRUE) {
## naIdx <- sample(ncol(oldData), round(percentMissing *
## ncol(oldData)))
## if (!all(1:ncol(oldData) %in% naIdx))
## break
## }
## oldData[i, naIdx] <- NA
## }
## m <- mice(as.data.frame(oldData), printFlag = FALSE, ...)
## newData <- complete(m, action = "long")
## newData <- newData[sample(nrow(newData), nData, replace = nrow(newData) <
## nData), -2:-1]
## return(newData)
## }
## <environment: namespace:CoxHD>
d <- as.matrix(dataFrame[mainIdxOsTD])
w <- groups[mainIdxOsTD] %in% c("Genetics","Fusions","CNA")
d[,w][! as.matrix(d[,w]) %in% c(0,1)] <- NA # remove those imputed ones
simData <- SimDataNonp(d, nData = 10000, m=10)
## Loading required package: mice
## mice 2.22 2014-06-10
names(simData) <- names(dataFrame[mainIdxOsTD])
Merge into data.frame
set.seed(42)
g <- groups[mainIdxOsTD]
for(w in which(colSums(simData,na.rm=TRUE) == 0))
simData[[w]] <- rbinom(nrow(simData),1,mean(dataFrame[mainIdxOsTD][,w]))
all(colSums(simData,na.rm=TRUE) != 0)
## [1] TRUE
simDataFrame <- cbind(simData,
MakeInteractions(simData[,g=="Genetics"], simData[,g=="Genetics"])[,as.vector(upper.tri(matrix(0,ncol=sum(g=="Genetics"), nrow=sum(g=="Genetics"))))])
for(n in unique(which(is.na(simDataFrame), arr.ind = TRUE)[,2]))
simDataFrame[[n]] <- poorMansImpute(simDataFrame[[n]])
simDataFrame <- StandardizeMagnitude(simDataFrame)
simDataFrame <- simDataFrame[,colnames(simDataFrame) %in% names(whichRFXOsTDGG) | colSums(simDataFrame)>=8]
simDataFrame$`NPM1:FLT3_ITD:DNMT3A` <- simDataFrame$NPM1 * simDataFrame$FLT3_ITD * simDataFrame$DNMT3A
dim(simDataFrame)
## [1] 10000 750
set.seed(42)
simGroups <- factor(c(as.character(g), rep("GeneGene", ncol(simDataFrame)-length(g))))
names(simGroups) <- colnames(simDataFrame)
simCoef <- CoxHD:::SimCoef(coxRFXFitOsTDGGc, groups = simGroups)
simRisk <- as.matrix(simDataFrame[names(whichRFXOsTDGG)]) %*% simCoef[names(whichRFXOsTDGG)]
simSurv <- SimSurvNonp(simRisk, os)
survConcordance(simSurv ~ simRisk)
## Call:
## survConcordance(formula = simSurv ~ simRisk)
##
## n= 10000
## Concordance= 0.7155292 se= 0.003360213
## concordant discordant tied.risk tied.time std(c-d)
## 35433824.0 14087292.0 100.0 0.0 332803.7
save(coxRFXFitOsTDGGc, whichRFXOsTDGG, simDataFrame, simGroups, os, mainGroups, file="sim2Data.RData")
The following code is run on the farm
read_chunk('../code/Farmulations2.R', labels="farmulationsCode")
load("sim2Data.RData")
library(mg14)
library(CoxHD)
library(parallel)
nData <- c(100, 200, 500, 1000, 2000, 5000, 10000)
nJobs <- as.numeric(Sys.getenv("NJOBS"))
jobIndex <- as.numeric(Sys.getenv("LSB_JOBINDEX"))
set.seed(jobIndex)
simCoef <- CoxHD:::SimCoef(coxRFXFitOsTDGGc, groups = simGroups)
simRisk <- as.matrix(simDataFrame[names(whichRFXOsTDGG)]) %*% simCoef[names(whichRFXOsTDGG)]
simRisk <- simRisk - (colMeans(simDataFrame[names(whichRFXOsTDGG)]) %*% simCoef[names(whichRFXOsTDGG)])[1]
simSurv <- SimSurvNonp(simRisk, os)
for(n in nData){
s <- if(n < 10000) sample(1:nrow(simDataFrame), n) else 1:10000
f <- CoxRFX(simDataFrame[s,names(whichRFXOsTDGG)], simSurv[s], simGroups[names(whichRFXOsTDGG)], nu=1,which.mu=mainGroups)
f$X <- NULL
assign(paste0("w",n), s)
assign(paste0("fit",n),f)
}
rm(simDataFrame)
save.image(file=paste("simRFX/",Sys.getenv("LSB_JOBNAME"),"_",Sys.getenv("LSB_JOBINDEX"),".RData", sep=""))
Read files
files <- dir("../code/simRFX", pattern="Farmulations\\[1-1000\\]*", full.names = TRUE)
tmp <- new.env()
load(files[1], envir = tmp)
Plot the P-values as a function of Npu^2.
w <- groups[whichRFXOsTDGG] %in% c("Genetics","Fusions","CNA", "GeneGene") ## Which groups
psi <- mean(os[,2]) ## Fraction of uncensored observations
plot(colSums(simDataFrame[names(whichRFXOsTDGG[w])]) * tmp$simCoef[whichRFXOsTDGG[w]]^2 , CoxHD:::WaldTest( tmp$fit10000)$p[w], log="yx", pch=NA, xlab=expression(psi *N *p *beta^2), ylab="P-value", ylim=c(1e-50,1))
for(f in files[1:50]){
load(f, envir = tmp)
points(psi*colSums(simDataFrame[names(whichRFXOsTDGG[w])]) * tmp$simCoef[names(whichRFXOsTDGG[w])]^2 , CoxHD:::WaldTest( tmp$fit10000)$p[w], col=colGroups[as.character(groups)[whichRFXOsTDGG[w]]], pch=1, cex=.5)
points(psi*colSums(simDataFrame[tmp$w1000, names(whichRFXOsTDGG[w])]) * tmp$simCoef[names(whichRFXOsTDGG[w])]^2 , CoxHD:::WaldTest( tmp$fit1000)$p[w], col=colGroups[as.character(groups)[whichRFXOsTDGG[w]]], pch=2, cex=.5)
if(tmp$fit100$iter[1] < 50) ## Exclude simulations without convergence
points(psi*colSums(simDataFrame[tmp$w100, names(whichRFXOsTDGG[w])]) * tmp$simCoef[names(whichRFXOsTDGG[w])]^2 ,CoxHD:::WaldTest( tmp$fit100)$p[w], col=colGroups[as.character(groups)[whichRFXOsTDGG[w]]], pch=3, cex=.5)
}
legend("bottomleft", lty=c(0,1),pch=c(1,NA), c("Simulations","Schoenfeld"), bty="n")
x <- 10^seq(-4,4,0.1)
lines(x, pnorm(sqrt(x), lower.tail = FALSE))
The theoretical power according to Schoenfeld/Schmoor is given by (Schmoor, Sauerbrei, and Schumacher 2000):
power <- function(beta, N, p, psi=0.5, alpha=0.05){
pnorm(sqrt(N*psi*beta^2*p*(1-p))-qnorm(1-alpha/2))
}
Plot for observed cases and overlay a few usual suspects
x <- seq(-2,2,0.01)
y <- 10^seq(-4,0,0.01)
colLevels <- colorRampPalette(brewer.pal(9, "Reds")[-(1:2)])(11)
g <- c("Fusions","CNA","Genetics","GeneGene")
xObs <- matrix(exp(rep(coxRFXFitOsTDGGc$mu[g], each=2) + c(-1,1) * rep(sqrt(coxRFXFitOsTDGGc$sigma2[g]),each=2)), nrow=2) ## Mean log haz +/- sd
yObsQ <- sapply(split(colMeans(dataFrameOsTD[whichRFXOsTDGG]), groups[whichRFXOsTDGG]),quantile, c(0.05,0.5,0.95))[,g] ## 5,50,95% frequency quantiles
contour(outer(x,y,function(x,y) power(x,1540,y)), x=exp(x),y=y, log='xy', xlab="Hazard ratio", ylab="Mutation frequency", main="N=1540", col=colLevels)
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in box(...): "log" is not a graphical parameter
rect(xObs[1,],yObsQ[1,],xObs[2,],yObsQ[3,], border = colGroups[c("Fusions","CNA","Genetics","GeneGene")])
#segments(exp(coxRFXFitOsTDGGc$mu[g]),yObsQ[1,],exp(coxRFXFitOsTDGGc$mu[g]),yObsQ[3,], col = colGroups[g])
#segments(xObs[1,],yObsQ[2,],xObs[2,],yObsQ[2,], col = colGroups[g])
effects <- c("NPM1","TP53","inv3_t3_3","t_15_17","inv16_t16_16","CEBPA_bi","FLT3_ITD","complex","NPM1:FLT3_ITD:DNMT3A") ## A few interesting variables
points(exp(coef(coxRFXFitOsTDGGc)[effects]), colMeans(dataFrame[effects]), col=colGroups[as.character(groups[effects])], pch=19)
text(labels=effects,exp(coef(coxRFXFitOsTDGGc)[effects]), colMeans(dataFrame[effects]), pos=ifelse(sign(coef(coxRFXFitOsTDGGc)[effects])==1,4,2))
legend("bottom", lty=c(1,NA,NA,NA,NA,NA),pch=c(NA,19,22,22,22,22), c("Power","Selected variables", paste("Dist.", g)), col=c(colLevels[10], "black", colGroups[g]), bty="n", ncol=2)
Compared to other cohort sizes
for(N in c(100,1000,10000)){
contour(outer(x,y,function(x,y) power(x,N,y)), x=exp(x),y=y, log='xy', xlab="Hazard ratio", ylab="Mutation frequency", main=paste("N=",N,sep=""), col=colLevels, drawlabels=FALSE)
rect(xObs[1,],yObsQ[1,],xObs[2,],yObsQ[3,], border = colGroups[g])
}
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in box(...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in box(...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in box(...): "log" is not a graphical parameter
C <- sapply(files[1:500], function(f){
load(f)
r <- c(sapply(tmp$nData, function(n){
survConcordance(SimSurvNonp(simRisk[get(paste0('w',n))], os)~get(paste0("fit",n))$linear.predictors)$concordance
}),survConcordance(SimSurvNonp(simRisk, os)~simRisk)$concordance)
names(r) <- c(nData,"Truth")
return(r)
})
boxplot(t(C), staplewex=0, pch=16, lty=1, ylab="", ylab="Concordance", xaxt="n")
rotatedLabel(labels=(sub(".concordant","", rownames(C))))
abline(h=CoxHD:::ConcordanceFromVariance(var(simRisk)))
load("../code/sim2Data.RData")
R <- sapply(files[1:100], function(f){
load(f, envir=.GlobalEnv)
r <- c(sapply(tmp$nData, function(n){
f <- get(paste0("fit",n))
assign("s", get(paste0("w",n)), envir=.GlobalEnv)
x <- as.matrix(simDataFrame[s, names(coef(f))])
h <- x %*% coef(f)
#z <- t(t(x)-colMeans(x))
#e <- rowSums(z %*% f$var2 * z)
#return(mean(e))
S <- try(survfit(f, newdata = as.data.frame(t(colMeans(x)))))
if(class(S)[1]=="try-error") return(NA)
hazardDist <- splinefun(H0$time, H0$hazard, method="monoH.FC")
sapply(seq(0,365*5, by=365/4), function(w){
w <- which.min(abs(S$time-w))
p <- S$surv[w]^exp(h - mean(h))
q <- S$surv[w]^exp(simRisk[s])
abs(p-q)})
}))
names(r) <- c(nData)
return(r)
})
q <- sapply(1:nrow(R),function(i) apply(Reduce("rbind", R[i,]),2,quantile, c(0.025,0.25,0.5,0.75,0.975), na.rm=TRUE), simplify="array")
contour(q[3,,],x=seq(0,365*5, by=365/4)/365, y=nData , log='y', xlab="Time (years)", ylab="Cohort size", las=1, xlim=c(0,3))
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in box(...): "log" is not a graphical parameter
plot(nData,q[3,5,], log='xy', ylim=c(1e-2,1), type='l', ylab="Prediction error", lwd=2, xlab="Cohort size")
polygon(c(nData,rev(nData)), c(q[1,5,], rev(q[5,5,])), border=NA, col="#88888844")
polygon(c(nData,rev(nData)), c(q[2,5,], rev(q[4,5,])), border=NA, col="#88888844")
#rotatedLabel(labels=(sub(".concordant","", rownames(q))))
par(mar=c(3,3,1,1), bty='n', mgp=c(2,0.5,0))
cohort <- function(beta, p, psi=0.5, alpha=0.05, power=0.5){
(qnorm(1-alpha/2) + qnorm(1-power) )^2 / (beta^2 * psi * p * (1-p))
}
x <- seq(-2,2, 0.01)
y <- 10^seq(-3,0, 0.01)
contour(outer(x,y,function(x,y) cohort(x,y, alpha=0.05/100)), x=exp(x),y=y, log='xy', xlab="Hazard ratio", ylab="Mutation frequency", col=colLevels, levels=c(10,20,50,100,200,500,1000,2000,5000,10000,20000))
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "log" is not a graphical parameter
## Warning in box(...): "log" is not a graphical parameter
rect(xObs[1,],yObsQ[1,],xObs[2,],yObsQ[3,], border = colGroups[c("Fusions","CNA","Genetics","GeneGene")])
effects <- c("NPM1","TP53","inv3_t3_3","t_15_17","inv16_t16_16","CEBPA_bi","FLT3_ITD","complex","NPM1:FLT3_ITD:DNMT3A") ## A few interesting variables
points(exp(coef(coxRFXFitOsTDGGc)[effects]), colMeans(dataFrame[effects]), col=colGroups[as.character(groups[effects])], pch=19)
text(labels=effects,exp(coef(coxRFXFitOsTDGGc)[effects]), colMeans(dataFrame[effects]), pos=ifelse(sign(coef(coxRFXFitOsTDGGc)[effects])==1,4,2))
#legend("bottom", lty=c(1,NA,NA,NA,NA,NA),pch=c(NA,19,22,22,22,22), c("Power","Selected variables", paste("Dist.", g)), col=c(colLevels[10], "black", colGroups[g]), bty="n", ncol=2)
Number of cases needed
par(mar=c(3,5,1,1), bty='n', mgp=c(2.5,0.5,0))
cases <- function(beta, alpha=0.05, power=0.5, p = 1e-2, psi=0.5){
(qnorm(1-alpha/2) + qnorm(1-power) )^2 / (beta^2 * (1-p) * psi)
}
x <- seq(-1,1,0.01)
x0 <- log(c(0.01,0.02,0.05,0.1,0.2,0.5,1)+1)
plot(exp(x), cases(x, alpha=5e-2), log='yx', type='l', xlab="Hazard ratio", ylab="Minimal number of cases", las=1)
#lines(exp(x), cases(x, alpha=1e-2), type='l', lty=2)
lines(exp(x), cases(x, alpha=1e-3), type='l', lty=3)
segments(exp(x0), par("usr")[3],exp(x0),cases(x0, alpha=5e-2), col='grey')
segments(exp(x[1]), cases(x0, alpha=5e-2),exp(x0),cases(x0, alpha=5e-2), col='grey')
axis(side=2, at=cases(x0, alpha=5e-2), labels=exp(x0), tcl=.5, line=0, las=2, mgp=c(-2.5,-.5,0), hadj=0)
axis(side=1, at=c(seq(0.1,3,0.1)), labels=rep("",30), tcl=-.2, line=0, las=2)
axis(side=2, at=rep(c(1:10), 4) * 10^rep(1:4, each=10), labels=rep("",40), tcl=-.2, line=0, las=2)
legend("topright", legend=c("P < 0.05 *","P < 0.001 ***"), lty=c(1,3), bty="n")
The following function simulates data from the 5-stage multistage RFX model
SimSurv5 <- function(coxRFXNcdTD, coxRFXCrTD, coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data, coxphOs, coxphPrs, censInd, censCr, censRel){
## Step 1: Compute KM survival curves and log hazard
getS <- function(coxRFX, data, max.x=5000) {
if(!is.null(coxRFX$na.action)) coxRFX$Z <- coxRFX$Z[-coxRFX$na.action,]
data <- as.matrix(data[,match(colnames(coxRFX$Z),colnames(data))])
r <- PredictRiskMissing(coxRFX, data, var="var2")
H0 <- basehaz(coxRFX, centered = FALSE)
hazardDist <- splinefun(H0$time, H0$hazard, method="monoH.FC")
invHazardDist <- splinefun(c(0,H0$hazard), c(0,H0$time), method="monoH.FC")
x <- c(0:ceiling(max.x))
S <- exp(-hazardDist(x))
return(list(S=S, r=r, x=x, hazardDist=hazardDist, invHazardDist=invHazardDist, r0 = coxRFX$means %*% coef(coxRFX)))
}
x <- 15000
kmCr <- getS(coxRFX = coxRFXCrTD, data = data, max.x=max(x))
kmEs <- getS(coxRFX = coxRFXNcdTD, data = data, max.x=max(x))
kmCir <- getS(coxRFX = coxRFXRelTD, data = data, max.x=max(x))
kmNrm <- getS(coxRFX = coxRFXNrdTD, data = data, max.x=max(x))
kmPrs <- getS(coxRFX = coxRFXPrdTD, data = data, max.x=max(x))
getCens <- function(surv, n){
F <- survfit(surv~1)
FCensInv <- splinefun(F$surv, F$time)
censTimes <- FCensInv(runif(n,0,1)) ## Simulate censoring times
}
censIndTimes <- getCens(censInd, nrow(data))
censCrTimes <- getCens(censCr, nrow(data))
censRelTimes <- getCens(censRel, nrow(data))
as.data.frame(t(sapply(1:nrow(data), function(i){
crTime <- edTime <- relTime <- nrdTime <- prdTime <- NA
status <- 1
crTime <- kmCr$invHazardDist(rexp(1, exp(kmCr$r[i,1])))
edTime <- kmEs$invHazardDist(rexp(1, exp(kmEs$r[i,1])))
firstTime <- pmin(edTime, crTime, censIndTimes[i])
if(firstTime==censIndTimes[i]){
edTime <- firstTime
status <- 0
crTime <- NA
}
if(firstTime==edTime){
crTime <- NA
}else{
edTime <- NA
rInd <- predict(coxphOs, newdata=data.frame(time0=crTime))
relTime <- kmCir$invHazardDist(rexp(1, exp(kmCir$r[i,1] + rInd)))
nrdTime <- kmNrm$invHazardDist(rexp(1, exp(kmNrm$r[i,1] + rInd)))
secondTime <- pmin(relTime, nrdTime, censCrTimes[i])
if(secondTime==censCrTimes[i]){
nrdTime <- secondTime
relTime <- NA
status <- 0
}
if(secondTime==nrdTime){
relTime <- NA
}else{
nrdTime <- NA
rCr <- predict(coxphPrs, newdata=data.frame(time0=relTime))
prdTime <- kmPrs$invHazardDist(rexp(1, exp(kmPrs$r[i,1] + rCr)))
if(prdTime > censRelTimes[i]){
prdTime <- min(prdTime, censRelTimes[i])
status <- 0
}
}
}
times <- c(crTime=crTime, edTime=edTime, relTime=relTime+crTime, nrdTime=nrdTime+crTime, prdTime=prdTime+crTime+relTime, status=status)
return(times)
}, simplify='array')))
}
First prepare the data. Allograft indices:
alloIdx <- clinicalData$TPL_type %in% c("ALLO","FREMD") # only allografts
alloTimeRel <- clinicalData$TPL_date - clinicalData$Recurrence_date + .5 # +.5 to make > 0
alloTimeRel[!alloIdx | (clinicalData$TPL_date < clinicalData$Recurrence_date & !clinicalData$TPL_Phase %in% c("CR1","RD"))] <- NA
Spline fitted transition probabilities.
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index]))
coxphOs <- coxph(Surv(time1,time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1])))
Censoring distributions
censInd <- Surv(clinicalData$OS, 1-clinicalData$Status)[is.na(clinicalData$CR_date)]
censCr <- Surv(as.numeric(clinicalData$Date_LF - clinicalData$CR_date), 1-clinicalData$Status)[!is.na(clinicalData$CR_date) & is.na(clinicalData$Recurrence_date)]
censRel <- Surv(as.numeric(clinicalData$Date_LF - clinicalData$Recurrence_date), 1-clinicalData$Status)[!is.na(clinicalData$CR_date) & !is.na(clinicalData$Recurrence_date)]
Simulate outcomes
set.seed(42)
simSurv5 <- SimSurv5(coxRFXNcdTD, coxRFXCrTD, coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, data, coxphOs, coxphPrs, censInd, censCr, censRel)
plot(survfit(Surv(apply(simSurv5[,1:5],1,max,na.rm=TRUE), simSurv5$status) ~ 1), xlim=c(0,5000))
lines(survfit(Surv(clinicalData$OS, clinicalData$Status) ~ 1), col='red')
Now reestimate models in the scenario of a 10,000 patient cohort
set.seed(42)
simDataFrame$transplantCR1 <- rbinom(nrow(simDataFrame), 1, mean(data$transplantCR1))
simDataFrame$transplantRel <- rbinom(nrow(simDataFrame), 1, mean(data$transplantRel))
simDataSurv5 <- SimSurv5(coxRFXNcdTD, coxRFXCrTD, coxRFXNrdTD, coxRFXRelTD, coxRFXPrdTD, simDataFrame, coxphOs, coxphPrs, censInd, censCr, censRel)
Estimate RFX transition rates
simCr <- Surv(ifelse(!is.na(simDataSurv5$crTime), simDataSurv5$crTime, simDataSurv5$edTime), !is.na(simDataSurv5$crTime))
simRfxCr <- CoxRFX(simDataFrame[names(crGroups)], simCr, groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
simNcd <- Surv(ifelse(!is.na(simDataSurv5$edTime), simDataSurv5$edTime, simDataSurv5$crTime), simDataSurv5$status & !is.na(simDataSurv5$edTime))
simRfxEs <- CoxRFX(simDataFrame[names(crGroups)], simNcd, groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
simRel <- Surv(ifelse(!is.na(simDataSurv5$relTime), simDataSurv5$relTime, simDataSurv5$nrdTime) - simDataSurv5$crTime, !is.na(simDataSurv5$relTime))
simRfxRel <- CoxRFX(simDataFrame[names(crGroups)], simRel, groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
simNrd <- Surv(ifelse(!is.na(simDataSurv5$relTime), simDataSurv5$relTime, simDataSurv5$nrdTime) - simDataSurv5$crTime, simDataSurv5$status & !is.na(simDataSurv5$nrdTime))
simRfxNrs <- CoxRFX(simDataFrame[names(crGroups)], simNrd, groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
simPrd <- Surv(simDataSurv5$prdTime - simDataSurv5$relTime, simDataSurv5$status)
simRfxPrs <- CoxRFX(simDataFrame[names(crGroups)], simPrd, groups=crGroups, which.mu = intersect(mainGroups, unique(crGroups)))
plot(coef(coxRFXCrTD),coef(simRfxCr))
cor(coef(coxRFXCrTD),coef(simRfxCr))
## [1] 0.906445
plot(coef(coxRFXNcdTD),coef(simRfxEs))
cor(coef(coxRFXNcdTD),coef(simRfxEs))
## [1] 0.7123169
plot(coef(coxRFXRelTD),coef(simRfxRel))
cor(coef(coxRFXRelTD),coef(simRfxRel))
## [1] 0.9252248
plot(coef(coxRFXNrdTD),coef(simRfxNrs))
cor(coef(coxRFXNrdTD),coef(simRfxNrs))
## [1] 0.8902996
plot(coef(coxRFXPrdTD),coef(simRfxPrs))
cor(coef(coxRFXPrdTD),coef(simRfxPrs))
## [1] 0.7955286
Now compute the multistage RFX model
xmax <- 2000
xx <- 0:ceiling(xmax)
simPrs <- coxph(Surv(prdTime-relTime, status)~ pspline(relTime-crTime, df=10), data=simDataSurv5)
simPrsBaseline <- exp(predict(simPrs, newdata=data.frame(relTime=xx[-1], crTime=0))) ## Hazard (function of CR length)
simOs <- coxph(Surv(pmax(nrdTime, prdTime, na.rm=TRUE)-crTime, status)~ pspline(crTime, df=5), data=simDataSurv5)
simOsBaseline <- exp(predict(simOs, newdata=data.frame(crTime=xx[-1]))) ## Hazard (function of CR length)
simMultiRfx5 <- MultiRFX5(simRfxEs, simRfxCr, simRfxNrs, simRfxRel, simRfxPrs, data, tdPrmBaseline = simPrsBaseline, tdOsBaseline = simOsBaseline, x=xmax)
plot(colSums(fiveStagePredicted[3*365,1:3,]), colSums(simMultiRfx5[3*365,1:3,]))
abline(0,1)
Also compute the predicted benefit of allografts with confidence intervals
d <- osData[1:nrow(dataFrame),]
d$transplantCR1 <- 0
d$transplantRel <- 0
simMultiRFX3TplCi <- MultiRFX3TplCi(simRfxNrs, simRfxRel, simRfxPrs, data=d[,colnames(coxRFXNrdTD$Z)], x=3*365, nSim=200, prdData=prdData) ## others with 200
plot(multiRFX3TplCi["dCr1Rel","hat","os",] , simMultiRFX3TplCi["dCr1Rel","hat","os",], xlab="Benefit 1,540 patients", ylab="Benefit 10,000 patients")
plot(multiRFX3TplCi["dCr1Rel","upper","os",] - multiRFX3TplCi["dCr1Rel","lower","os",], simMultiRFX3TplCi["dCr1Rel","upper","os",]-simMultiRFX3TplCi["dCr1Rel","lower","os",], xlab="CI width 1,540 patients", ylab="CI width 10,000 patients")
abline(0,0.5)
Here we reassess the effect of HSCTs, also cosidering the magnitude of prediction errors in the current data set and based on extrapolated errors.
Benefit v number of allografts in CR1
par(bty="L")
s <- clinicalData$AOD < 60 & !is.na(clinicalData$CR_date) & !clinicalData$TPL_Phase %in% c("PR1","RD1")
fAlloRelapse <- sum(prdData$transplantRel & s[!is.na(clinicalData$Recurrence_date)][prdData$index])/sum(relData$status & !relData$transplantCR1 & s[relData$index]) # fraction of patients that have received a salvage transplant
benefitAllo <- multiRFX3TplLoo[,"cr1"] - (fAlloRelapse*multiRFX3TplLoo[,"rel"] +(1-fAlloRelapse)*multiRFX3TplLoo[,"none"])
o <- order(-benefitAllo + ifelse(!s,NA,0), na.last=NA)
pRelapse <- 1+multiRFX3TplCiLoo[1:2,1,"aar",] - multiRFX3TplCiLoo[1:2,1,"rs",] ## Relapse probabilities
fRelapse <- sapply(seq_along(o), function(i) mean(c(pRelapse[2,o[1:i]], pRelapse[1,o[-(1:i)]]), na.rm=TRUE)) # Personalised
sIdeal <- sapply(seq_along(o), function(i) mean(c(multiRFX3TplLoo[o[1:i],"cr1"], (1-fAlloRelapse)*multiRFX3TplLoo[o[-(1:i)],"none"] + fAlloRelapse*multiRFX3TplLoo[o[-(1:i)],"rel"]), na.rm=TRUE))
x <- seq_along(sIdeal)/length(sIdeal)
plot(x + (1-x)*fRelapse*fAlloRelapse,sIdeal, type='l', xlab="Total fraction of allografts", ylab="Survival of eligible patients 3yrs after CR", col=set1[1], xaxs="i", yaxs="i", lty=3)
ci <- multiRFX3TplCiLoo["dCr1Rel","upper","os",]-multiRFX3TplCiLoo["dCr1Rel","lower","os",] # 1540 patients
sCi1540 <- rowMeans(sapply(1:10, function(foo){ set.seed(foo)
o <- order(-benefitAllo + ifelse(!s,NA,0) + rnorm(1540,sd=ci/4), na.last=NA)
s <- sapply(seq_along(o), function(i) mean(c(multiRFX3TplLoo[o[1:i],"cr1"], (1-fAlloRelapse)*multiRFX3TplLoo[o[-(1:i)],"none"] + fAlloRelapse*multiRFX3TplLoo[o[-(1:i)],"rel"]), na.rm=TRUE))
}))
lines(x + (1-x)*fRelapse*fAlloRelapse, sCi1540, type='l',col=set1[1], lty=1)
simCi <- simMultiRFX3TplCi["dCr1Rel","upper","os",]-simMultiRFX3TplCi["dCr1Rel","lower","os",]
sCi10000 <- rowMeans(sapply(1:10, function(foo){ set.seed(foo)
o <- order(-benefitAllo + ifelse(!s,NA,0) + rnorm(1540,sd=simCi/4), na.last=NA)
s <- sapply(seq_along(o), function(i) mean(c(multiRFX3TplLoo[o[1:i],"cr1"], (1-fAlloRelapse)*multiRFX3TplLoo[o[-(1:i)],"none"] + fAlloRelapse*multiRFX3TplLoo[o[-(1:i)],"rel"]), na.rm=TRUE))
}))
lines(x + (1-x)*fRelapse*fAlloRelapse, sCi10000, type='l',col=set1[1], lty=2)
p <- order(na.zero(c(1,4,2,3)[clinicalData$M_Risk]) + dataFrame$AOD_10/20 + ifelse(!s,NA,0), na.last=NA)
fRelapseEln <- sapply(seq_along(p), function(i) mean(c(pRelapse[2,p[1:i]], pRelapse[1,p[-(1:i)]]), na.rm=TRUE)) # ELN
sEln <- sapply(seq_along(p), function(i) mean(c(multiRFX3TplLoo[p[1:i],"cr1"], (1-fAlloRelapse)*multiRFX3TplLoo[p[-(1:i)],"none"] + fAlloRelapse*multiRFX3TplLoo[p[-(1:i)],"rel"]), na.rm=TRUE))
x <- seq_along(sEln)/length(sEln)
lines(x + (1-x)*fRelapseEln*fAlloRelapse,sEln, sEln, type='l', col=set1[2])
legend("bottomright", c("Personalised risk", "Idealised","10,000 patients","This cohort", "Standard risk","ELN and age"), col=set1[c(NA,1,1,1,NA,2)],lty=c(NA,3,2,1,NA,1), bty="n", text.font=c(2,1,1,1,2,1))
Total numbers of transplants
fAlloCR1 <- 0.3 ## Assume 30% allografts in CR1
i <- which(x > fAlloCR1)[1] - 1
c(`Knowlege bank`=(x + (1-x)*fRelapse*fAlloRelapse)[i], ELN=(x + (1-x)*fRelapseEln*fAlloRelapse)[i])
## Knowlege bank ELN
## 0.4358391 0.4380818
Projected survival at 3yrs
c(ELN=sEln[i], `This cohort`=sCi1540[i], `10000 patients` = sCi10000[i], Optimal=s[i])
## ELN This cohort 10000 patients Optimal
## 0.6117393 0.6201484 0.6234637 1.0000000
Achieve same survival as ELN with the following number of allografts
j <- c(`This cohort`=which(sCi1540 >= sEln[i])[1]-1, `10000 patients`=which(sCi10000 >= sEln[i])[1]-1, Optimal=which(sIdeal >= sEln[i])[1]-1)
fAlloCR1Pers <- (x + (1-x)*fRelapse*fAlloRelapse)[j]
names(fAlloCR1Pers) <- names(j)
fAlloCR1Pers
## This cohort 10000 patients Optimal
## 0.3634495 0.3463946 0.3392658
We have implemented the aformentioned multistage prediciton model as a shiny webserver.
The following data is saved for running the webtool:
save(coxRFXRelTD, coxRFXNrdTD, coxRFXPrdTD, coxRFXOsCR, coxRFXNcdTD, coxRFXCrTD, cr, nrdData, relData, prdData, osData, crGroups, data, clinicalData, file="../code/multistage/multistage.RData")
The server runs according to the following script
read_chunk('../code/multistage/server.R', labels="server.R")
library(shiny)
library(RColorBrewer)
library(CoxHD)
library(Rcpp)
load("multistage.RData", envir=globalenv())
cr <<- cr
set1 <- brewer.pal(8, "Set1")
pastel1 <- brewer.pal(8, "Pastel1")
s <- !crGroups %in% c("Nuisance","GeneGene") & ! names(crGroups) %in% c("ATRA","VPA")
VARIABLES <- names(crGroups)[s]
rg <- c("Fusions"=5, "CNA"=4,"Genetics"=3, "Clinical"=7, "Demographics"=8, "Treatment"=6)
o <- order(rg[crGroups[s]],(coef(coxRFXPrdTD)^2/diag(coxRFXPrdTD$var2))[VARIABLES], decreasing=TRUE)
VARIABLES <- VARIABLES[o]
NEWGRP <- c(0,diff(as.numeric(as.factor(crGroups))[s][o])) != 0
names(NEWGRP) <- VARIABLES
INTERACTIONS <- names(crGroups)[crGroups %in% "GeneGene"]
NUISANCE <- names(crGroups)[crGroups %in% "Nuisance" | names(crGroups) %in% c("ATRA","VPA")]
SCALEFACTORS<- rep(1, length(VARIABLES))
names(SCALEFACTORS) <- VARIABLES
w <- crGroups[VARIABLES] %in% c("Demographics","Clinical")
r <- regexpr("(?<=_)[0-9]+$", VARIABLES[w], perl=TRUE)
SCALEFACTORS[w][r!=-1] <- as.numeric(regmatches(VARIABLES[w],r))
CATEGORIES <- sapply(VARIABLES, function(x){
if(length(unique(data[,x])) <= 10){
c <- min(data[,x]):max(data[,x])
if(all(c %in% 0:1))
names(c) <- c("absent","present")
else if(x =="gender")
names(c) <- c("male","female")
return(c)
}
else
NULL
})
LABELS <- sapply(VARIABLES, function(x){
r <- round(range(data[,x]*SCALEFACTORS[x], na.rm=TRUE),1)
i <- paste0(" [",r[1],"-",r[2],"]")
paste0(sub(paste0("_",SCALEFACTORS[x],"$"),"",x), ifelse(is.null(CATEGORIES[[x]]),i,""))
})
LABELS["AOD_10"] <- sub("AOD", "Age at diagnosis (yr)", LABELS["AOD_10"])
LABELS["LDH_1000"] <- sub("LDH", "Lactic Acid Dehydrogenase (units/l)", LABELS["LDH_1000"])
LABELS["wbc_100"] <- sub("wbc", "White cell count (1e-9/l)", LABELS["wbc_100"])
LABELS["HB_10"] <- sub("HB", "Hemoglobin (g/l)", LABELS["HB_10"])
LABELS["BM_Blasts_100"] <- sub("BM_Blasts", "Bone marrow blasts (%)", LABELS["BM_Blasts_100"])
LABELS["PB_Blasts_100"] <- sub("PB_Blasts", "Peripheral blood blasts (%)", LABELS["PB_Blasts_100"])
LABELS["platelet_100"] <- sub("platelet", "Platelet count (1e-9/l)", LABELS["platelet_100"])
LABELS["VPA"] <- "VPA (Valproic acid)"
LABELS["transplantCR1"] <- "Allograft in CR1"
LABELS["transplantRel"] <- "Allograft after Relapse"
LABELS["gender"] <- "Gender"
LABELS <- sub("t_*([a-z,0-9]+)_([a-z,0-9]+)", "t(\\1;\\2)", LABELS)
LABELS[crGroups[VARIABLES] %in% c("Fusions","CNA")] <- gsub("_","/",LABELS[crGroups[VARIABLES] %in% c("Fusions","CNA")])
LABELS <- sub("plus","+",LABELS)
LABELS <- sub("minus|^mono","-",LABELS)
LABELS <- sub("(_|/)*other"," (other)", LABELS)
LABELS <- sub("_([0-9a-zA-Z]+)"," (\\1)", LABELS)
COMPVAR <- list(`Allogeneic HSCT`=c(none="none", `in first CR`="transplantCR1", `after relapse`="transplantRel"), `AML type`=c(primary='AML', secondary='sAML',tertiary='tAML',other='oAML')) ## Compound variables (factors)
COMPIDX <- numeric(length(VARIABLES))
names(COMPIDX) <- VARIABLES
COMPIDX[c("transplantRel","oAML")] <- 1 ## Index of last elements for display
VAR2COMP <- unlist(sapply(names(COMPVAR), function(n) rep(n, length(COMPVAR[[n]]))))
names(VAR2COMP) <- unlist(COMPVAR)
#* AOD: Age on diagnosis
#* LDH: Lactic Acid Dehydrogenase (units/l)
#* WBC: White cell count (1e-9/l),
#* HB: Hemoglobin (g/l),
#* BM_Blasts: Bone marrow blasts (%)
#* PB_Blasts: Peripheral blood blasts (%)
cppFunction('NumericVector computeHierarchicalSurvival(NumericVector x, NumericVector diffS0, NumericVector S1Static, NumericVector haz1TimeDep) {
int xLen = x.size();
double h;
NumericVector overallSurvival(xLen);
for(int i = 0; i < xLen; ++i) overallSurvival[i] = 1;
for(int j = 1; j < xLen; ++j){
h = haz1TimeDep[j-1];
for(int i = j; i < xLen; ++i){
overallSurvival[i] += diffS0[j-1] * (1-pow(S1Static[i-j], h));
}
}
return overallSurvival;
}')
addGrid <- function() {
abline(h=seq(0,1,.2), lty=3)
abline(v=seq(0,2000,365), lty=3)
}
# Define server logic required to generate and plot a random distribution
shinyServer(function(input, output) {
getData <- reactive({
input$compute
isolate({
l <- list()
for(n in VARIABLES){
if(!n %in% unlist(COMPVAR)){
l[[n]] <- ifelse(input[[n]]=="NA",NA,as.numeric(input[[n]]))
if(is.null(input[[n]])) l[[n]] <- NA
}else{
l[[n]] <- ifelse(input[[VAR2COMP[n]]]=="NA", NA, input[[VAR2COMP[n]]]==n) + 0
if(is.null(input[[VAR2COMP[n]]])) l[[n]] <- NA
}
}
for(n in INTERACTIONS){
s <- strsplit(n, ":")[[1]]
l[[n]] <- l[[s[1]]] * l[[s[2]]]
}
for(n in NUISANCE)
l[[n]] <- NA
out <- do.call("data.frame",l)
names(out) <- names(l)
out[VARIABLES] <- out[VARIABLES]/SCALEFACTORS
return(out)
})
})
output$ui <- renderUI({
pdid <- input[["pdid"]]
if(is.null(pdid)) pdid <- "reset"
if( pdid=="reset"){
#cat("reset\n")
defaults <- data[1,]
defaults[] <- NA
}else
defaults <- data[pdid,]
defaults <- as.numeric(defaults)
## Obfuscation
defaults <- round(defaults * 5)/5
names(defaults) <- colnames(data)
defaults[VARIABLES] <- defaults[VARIABLES] * SCALEFACTORS
#cat(defaults,"\n")
c(list(tags$em(tags$b(crGroups[VARIABLES[1]]))),
lapply(VARIABLES,
function(x) {
d <- defaults[x]
f <- if(x %in% unlist(COMPVAR)){
if(!COMPIDX[x]) return(NULL)
s <- defaults[COMPVAR[[VAR2COMP[x]]][-1]]
w <- if(any(is.na(s))) 'N/A' else if(all(s==0)) 1 else if(any(!s %in% c(0,1))) 'N/A' else which(s==1)+1
c <- c(COMPVAR[[VAR2COMP[x]]], "N/A"="N/A")
radioButtons(VAR2COMP[x], label=VAR2COMP[x], choices=c, selected=c[w])
}else if(crGroups[x] %in% c("Genetics","CNA","Fusions","Treatment")){
if(!d %in% c(0,1)) d <- NA
d <- paste(d)
radioButtons(x, label=if(crGroups[x]=="Genetics") tags$em(LABELS[x]) else LABELS[x], choices=c("present"= "1", "absent"="0", "N/A"="NA"), selected=d)
}else{
r <- round(quantile(data[,x]*SCALEFACTORS[x], c(0.05,0.95), na.rm=TRUE),1)
if(is.null(CATEGORIES[[x]]))
numericInput(inputId=x, label=LABELS[x], value=d, min=r[1], max=r[2], step = if(round(min(data[,x]*SCALEFACTORS[x], na.rm=TRUE),1) %% 1 ==0) 1 else 0.1)
else{
if(!d %in% 0:10) d <- NA
d <- paste(d)
radioButtons(x, label=LABELS[x], choices=c(CATEGORIES[[x]],"N/A"="NA"), selected=d)
}
}
h <- if(NEWGRP[x]) list(tags$hr(), tags$em(tags$b(crGroups[x]))) else NULL
list(h,f)}
))
})
x <- 0:2000
computeIncidence <- function(coxRFX, r, x) {
#r=PredictRiskMissing(coxRFX, data, var="var2")
if(!is.null(coxRFX$na.action))
coxRFX$Z <- coxRFX$Z[-coxRFX$na.action,]
#r <- PredictRiskMissing(coxRFX, data, var="var2")
H0 <- basehaz(coxRFX, centered = FALSE)
hazardDist <- splinefun(H0$time, H0$hazard, method="monoH.FC")
lambda0 <- hazardDist(x)
r0 <- coxRFX$means %*% coef(coxRFX)
inc <- exp(-lambda0* exp(r[,1]))
ciup2 <- exp(-lambda0*exp( rep(r[,1] + 2*sqrt(r[,2]) * c(1), each=length(x))))
cilo2 <- exp(-lambda0*exp( rep(r[,1] + 2*sqrt(r[,2]) * c(-1), each=length(x))))
ciup <- exp(-lambda0*exp( rep(r[,1] + sqrt(r[,2]) * c(1), each=length(x))))
cilo <- exp(-lambda0*exp( rep(r[,1] + sqrt(r[,2]) * c(-1), each=length(x))))
#p <- PartialRisk(coxRFX, dataImputed)
return(list(inc=inc, r=r, x=x, hazardDist=hazardDist, r0 = r0, ciup=ciup, cilo=cilo, ciup2=ciup2, cilo2=cilo2))
}
plotRisk <- function(coxRFX, incidence, p, xlab="Days after diagnosis", ylab="Incidence", col="#FF0000",mark=NA, lty=2) {
plot(survfit(coxRFX), xlab=xlab, ylab=ylab, mark=mark, conf.int=FALSE, fun=function(x) 1-x, ylim=c(0,1), xlim=c(0,2000), lty=3)
#lines(survfit(coxRFX$surv ~ 1), lty=3, mark=NA, fun=function(x) 1-x)
polygon( c(incidence$x, rev(incidence$x)), 1-c(incidence$ciup2, rev(incidence$cilo2)), col=paste0(col,"44"), border=NA)
addGrid()
#polygon( c(risk$x, rev(risk$x)), 1-c(ciup, rev(cilo)), col=paste0(col,"44"), border=NA)
lines( incidence$x, 1-incidence$inc, col=col, lwd=2, lty=lty)
legend(ifelse((1-incidence$inc[length(incidence$inc)])>.5, "bottomright","topright"), c("Population avg","Predicted","95% CI"),bty="n", lty=c(2,1,NA), fill=c(NA,NA,paste0(col,"44")), border=c(NA,NA,NA), col=c(1,col,NA))
u <- par("usr")
par(new=T)
m <- colMeans(PartialRisk(coxRFX))
rds <- .05
p <- (matrix(p, nrow=1) - m)/10 + rds
colnames(p) <- names(m)
l <- cbind(x=0.25,y=ifelse((1-incidence$inc[length(incidence$inc)])>.8,0.3,.85))
c <- cut(incidence$r[,1]-incidence$r0, quantile(coxRFX$linear.predictor,seq(0,1,l=12)))
stars((p[,c("Demographics","Treatment","Fusions","CNA","Genetics","GeneGene","Clinical"), drop=FALSE]), scale=FALSE, locations = l, xlim=c(0,1), ylim=c(0,1), lwd=1, col.stars=rev(brewer.pal(11,"RdBu"))[c])
symbols(l, circles=rds, inches=FALSE, add=TRUE)
text(l[1]+cos(2*pi*0:6/7)*2*rds,l[2]+sin(2*pi*0:6/7)*2*rds,substr(c("Demographics","Treatment","Fusions","CNA","Genetics","GeneGene","Clinical"),1,5), cex=.66)
par(usr=u)
}
dataImputed <- reactive({
ImputeMissing(data[1:1540,], getData()[,colnames(data)])
})
models <- c("Ncd","Cr","Rel","Nrd","Prd")
riskMissing <- reactive({
sapply(models, function(m){
fit <- get(paste("coxRFX",m,"TD", sep=""))
if(!is.null(fit$na.action))
fit$Z <- fit$Z[-fit$na.action,]
PredictRiskMissing(fit, getData(), var="var2")}, simplify = FALSE)})
partialRiskMissing <- reactive({sapply(models, function(m){
fit <- get(paste("coxRFX",m,"TD", sep=""))
if(!is.null(fit$na.action))
fit$Z <- fit$Z[-fit$na.action,]
PartialRisk(fit, dataImputed())}, simplify = FALSE)})
output$Tab <- renderDataTable({
x <- dataImputed()
data.frame(Covariate=colnames(x),signif(data.frame(Input=as.numeric(getData()[,colnames(data)]), Imputed=as.numeric(x),
`Coef NCD`=coef(coxRFXNcdTD), `Value NCD`= as.numeric(x)*coef(coxRFXNcdTD),
`Coef CR`=coef(coxRFXCrTD), `Value CR`= as.numeric(x)*coef(coxRFXCrTD),
`Coef NRD`=coef(coxRFXNrdTD), `Value NRD`= as.numeric(x)*coef(coxRFXNrdTD),
`Coef Rel`=coef(coxRFXRelTD), `Value Rel`= as.numeric(x)*coef(coxRFXRelTD),
`Coef PRD`=coef(coxRFXPrdTD), `Value PRD`= as.numeric(x)*coef(coxRFXPrdTD)),2))
})
output$Risk <- renderDataTable({
t <- sapply(c("Ncd","Cr","Nrd","Rel","Prd"), function(m){
r <- riskMissing()[[m]]
x <- get(paste("coxRFX",m,"TD", sep=""))
Z <- if(!is.null(x$na.action)) x$Z[-x$na.action,] else x$Z
p <- PartialRisk(x, newZ= rbind(dataImputed(),colMeans(data[1:1540,])))
p <- p[1,]-p[2,]
#p <- p[-length(p)]
c(round(p,3), `total`=round(r[1,1] - mean(Z %*% coef(x)),3),
`sd`=round(sqrt(r[1,2]),3))
})
colnames(t) <- c("Death without CR (NCD)", "Complete remission (CR)", "Death without relapse (NRD)", "Relapse","Death after relapse (PRD)")
data.frame(Value=c(levels(coxRFXRelTD$groups),"total","s.d"), t, check.names=FALSE)
})
## Convolution approach to PRM
survPredict <- function(surv){
s <- survfit(surv~1)
splinefun(s$time, s$surv, method="monoH.FC")
}
prsP <- survPredict(Surv(prdData$time1, prdData$time2, prdData$status))(x) # Baseline Prs (measured from relapse)
coxphPrs <- coxph(Surv(time1, time2, status)~ pspline(time0, df=10), data=data.frame(prdData, time0=as.numeric(clinicalData$Recurrence_date-clinicalData$CR_date)[prdData$index]))
tdPrmBaseline <- exp(predict(coxphPrs, newdata=data.frame(time0=x[-1]))) ## Hazard (function of CR length)
coxphOs <- coxph(Surv(time1,time2, status)~ pspline(time0, df=10), data=data.frame(osData, time0=pmin(500,cr[osData$index,1])))
tdOsBaseline <- exp(predict(coxphOs, newdata=data.frame(time0=x[-1]))) ## Hazard (function of induction length), only for OS (could do CIR,NRM,PRS seperately)
# CR adjustments to obtain absolute probabilities
crAdjust <- function(x, y, time=x$x) {
xadj <- .crAdjust(x$inc, y$inc, time)
}
.crAdjust <- function(inc1, inc2, time) {
cumsum(c(1,diff(inc1) * splinefun(time, inc2)(time[-1])))
}
output$KM <- renderPlot({
par(bty="n", mar=c(3,3,2,1), mgp=c(2,0.5,0), tcl=-.25, xaxs="i", yaxs="i")
layout(matrix(1:3, ncol=3), widths=c(1,1,0.5))
par(cex=1)
## KM incidence of NCD and CR
kmNcd <- computeIncidence(coxRFX = coxRFXNcdTD, r = riskMissing()[["Ncd"]], x=x)
kmCr <- computeIncidence(coxRFX = coxRFXCrTD, r = riskMissing()[["Cr"]], x=x)
## Correct KM estimate for competing risk
ncd <- crAdjust(x= kmNcd, time=x, y=kmCr) ## Correct KM estimate for competing risk
cr <- crAdjust(x= kmCr, time=x, y=kmNcd) ## Correct KM estimate for competing risk
## KM incidence of Relapse and NRD
kmRel <- computeIncidence(coxRFX = coxRFXRelTD, r = riskMissing()[["Rel"]], x=x)
kmNrd <- computeIncidence(coxRFX = coxRFXNrdTD, r = riskMissing()[["Nrd"]], x=x)
## Correct KM estimate for competing risk
relCr <- crAdjust(x= kmRel, time=x, y=kmNrd) ## Correct KM estimate for competing risk
nrsCr <- crAdjust(x = kmNrd, time = x, y = kmRel)
## KM incidence of PRS
kmPrs <- computeIncidence(coxRFX = coxRFXPrdTD, r = riskMissing()[["Prd"]], x=x)
## Outcome after Remission
rsCr <- computeHierarchicalSurvival(x = x, diffS0 = diff(relCr), S1Static = prsP, haz1TimeDep = tdPrmBaseline * exp(kmPrs$r[,1]-kmPrs$r0))
osCr <- 1-(1-nrsCr)-(1-rsCr)
## Outcome from diagnosis
osDiag <- computeHierarchicalSurvival(x = x, diffS0 = diff(cr), S1Static = osCr, haz1TimeDep = tdOsBaseline)
nrsDiag <- computeHierarchicalSurvival(x = x, diffS0 = diff(cr), S1Static = nrsCr, haz1TimeDep = tdOsBaseline)
rsDiag <- computeHierarchicalSurvival(x = x, diffS0 = diff(cr), S1Static = rsCr, haz1TimeDep = tdOsBaseline)
relDiag <- computeHierarchicalSurvival(x = x, diffS0 = diff(cr), S1Static = relCr, haz1TimeDep = tdOsBaseline)
xLen <- length(x)
## Plot probabilities
plot(x, 1-(1-ncd)-(1-osDiag), type="l", xlab="Days from diagnosis", ylab="Probability", main="Outcome after diagnosis", ylim=c(0,1), lwd=3, lty=0)
y0 <- 1
y <- ncd
polygon(c(x, x[xLen]), c(y,y0), border=NA, col=pastel1[1])
y0 <- y0 - (1-ncd)
y <- y - (1-nrsDiag)
polygon(c(x, rev(x)), c(y, rev(y0)) , border=NA, col=pastel1[2])
y0 <- y0 - (1-nrsDiag)
y <- y - (1-rsDiag)
osDiag <- y
polygon(c(x, rev(x)), c(y, rev(y0)), border=NA, col=pastel1[3])
y0 <- y0 - (1-rsDiag)
y <- y - (1-relDiag) + (1-rsDiag)
polygon(c(x, rev(x)), c(y, rev(y0)), border=NA, col=pastel1[5])
polygon(c(x, rev(x)), c(cr - (1-ncd), rev(y)), border=NA, col=pastel1[4])
polygon(c(x, rev(x)), c(cr - (1-ncd), rev(rep(0, length(x)))), border=NA, col="#DDDDDD")
lines(x, osDiag, lwd=3)
z <- c(365,3*365)
y <- (osDiag)[z+1]
points(z,y, pch=16, col=1)
text(z, y, labels=round(y,2), pos=1)
addGrid()
## Confidence intervals
osLoDiag <- osUpDiag <- rep(NA, length(osDiag))
if("analytical" %in% input$ciType){
PlogP2 <- function(x) {(x * log(x))^2}
errOsCr <- kmNrd$r[,2] * PlogP2(kmNrd$inc) * (1-(1-kmRel$inc) * (1-kmPrs$inc))^2 + kmRel$r[,2] * PlogP2(kmRel$inc) * (1-kmPrs$inc)^2* kmNrd$inc^2 + kmPrs$r[,2] * PlogP2(kmPrs$inc) * (1-kmRel$inc)^2* kmNrd$inc^2
errOsCr <- sqrt(errOsCr / PlogP2(osCr))
osUpCr <- osCr ^ exp(2* errOsCr)
osLoCr <- osCr ^ exp(-2*errOsCr)
#segments(z, osLo[z+1] ,z,osUp[z+1], col=1, lwd=2)
}
if("simulated" %in% input$ciType){
## Simulate CI
nSim <- 200
osCrMc <- sapply(1:nSim, function(i){
r <- exp(rnorm(5,0,sqrt(c(kmRel$r[,2],kmNrd$r[,2],kmPrs$r[,2], kmNcd$r[,2], kmCr$r[,2]))))
nrsCr <- .crAdjust(kmNrd$inc^r[2], kmRel$inc^r[1], time=x) ## Correct KM estimate for competing risk
diffCir <- diff(kmRel$inc^r[1]) * kmNrd$inc[-1]^r[2] ## Correct KM estimate for competing risk
rsCr <- computeHierarchicalSurvival(x = x, diffS0 = diffCir, S1Static = prsP, haz1TimeDep = tdPrmBaseline * exp(kmPrs$r[,1]-kmPrs$r0+log(r[3])))
osCr <- 1-(1-nrsCr)-(1-rsCr)
cr <- .crAdjust(kmCr$inc^r[5], kmNcd$inc^r[4], time=x)
ncd <- .crAdjust(kmNcd$inc^r[4], kmCr$inc^r[5], time=x)
osDiag <- computeHierarchicalSurvival(x = x, diffS0 = diff(cr), S1Static = osCr, haz1TimeDep = tdOsBaseline)
return(cbind(osCr, osDiag - (1-ncd)))
}, simplify="array")
osCrMcQ <- apply(osCrMc,1:2,quantile, c(0.025,0.975))
osLoCr <- osCrMcQ[1,,1]
osUpCr <- osCrMcQ[2,,1]
osLoDiag <- osCrMcQ[1,,2]
osUpDiag <- osCrMcQ[2,,2]
}
## CI
lines(x, osUpDiag, col=1, lty=2)
lines(x, osLoDiag, col=1, lty=2)
## Plot outcome after remission
plot(NA,NA, xlab="Days from remission", ylab="Probability", xlim=c(0,2000), ylim=c(0,1), lty=2)
polygon(c(x, x[xLen]), c(nrsCr,1) , border=NA, col=pastel1[2])
polygon(c(x, rev(x)), c(nrsCr, rev(osCr)), border=NA, col=pastel1[3])
polygon(c(x, rev(x)), c(osCr, rev(1-(1-nrsCr)-(1-relCr))), border=NA, col=pastel1[5])
polygon(c(x, rev(x)), c(1-(1-nrsCr)-(1-relCr), rep(0,length(x))), border=NA, col=pastel1[4])
abline(h=seq(0,1,.2), lty=3)
abline(v=seq(0,2000,365), lty=3)
lines(x, osCr, col=1, lwd=3)
title("Outcome after remission")
y <- (osCr)[z+1]
points(z,y, pch=16, col=1)
text(z, y, labels=round(y,2), pos=1)
## CI
lines(x, osUpCr, col=1, lty=2)
lines(x, osLoCr, col=1, lty=2)
par(mar=c(0,0,0,0))
plot(NA,NA, xlab="",ylab="", xaxt="n", yaxt="n", xlim=c(0,1), ylim=c(0,1))
legend(x=0,y=1, col=c(NA,NA,NA,NA,NA,NA,"black","black"), lty=c(NA,NA,NA,NA,NA,NA,1,4), fill=c(pastel1[c(1,2,3,5,4)],"#DDDDDD",NA,NA), border=c(1,1,1,1,1,1,NA,NA), lwd=c(NA,NA,NA,NA,NA,NA,3,1), y.intersp = 1.5, c("Death without \nremission","Death without \nrelapse","Death after \nrelapse","Alive after \nrelapse","Alive in CR1", "Alive in \ninduction", "Overall survival", "95% C.I."), box.lwd = 0, bg="#FFFFFF88", seg.len=1)
})
})
The shiny user interface is defined by the following script
read_chunk('../code/multistage/ui.R', labels="ui.R")
library(shiny)
library(CoxHD)
load("multistage.RData", envir=globalenv())
# Define UI for application that plots random distributions
fluidPage(
# Style
tags$header(tags$style(
type = 'text/css',
'.well .special { max-height: 400px; overflow-y: auto; }'
)
),
# Application title
titlePanel("AML multistage predictions (beta)"),
fluidRow(
# Sidebar with a slider input for number of observations
column(3,
wellPanel(
selectInput("pdid", tags$b("Select sample"), c("reset",rownames(data)), selected = "reset", multiple=FALSE)
),
wellPanel(
tags$b("Prognostic variables"),
tags$hr(),
HTML('<div class="special">'),
uiOutput("ui"),
HTML("</div>")
),
wellPanel(actionButton("compute", "Compute survival"),
tags$hr(),
radioButtons("ciType", tags$b("Confidence intervals"), choices=c("analytical (fast, CR only)"="analytical","simulated (slow)"="simulated"), selected = "analytical"), ## CI type
tags$hr(),
div(HTML('<b><a href="help.html">Help</a></b>')))
),
# Show a plot of the generated distribution
column(8,
plotOutput(outputId="KM",height="300px"),
tabsetPanel(
tabPanel('Risk',
dataTableOutput("Risk")),
tabPanel("Coefficients",
dataTableOutput("Tab"))
))
)
)
This document was written entirely in R with markdown annotation. It was compiled with knitr::spin() (Xie 2015) and pandoc using the rmarkdown package (Allaire et al. 2015):
rmarkdown::render("SupplementaryMethodsCode.R")
The total runtime is approximately 24h using 10 cores. This excludes the extrapolations, which were run on a a computing grid.
The packages and specifics of the R session are:
library(devtools)
devtools::session_info()
## Session info -----------------------------------------------------------------------------------------------------------
## setting value
## version R version 3.1.2 (2014-10-31)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_GB.UTF-8
## tz <NA>
## Packages ---------------------------------------------------------------------------------------------------------------
## package * version date source
## abind 1.4-0 2011-11-27 CRAN (R 3.1.2)
## BiocGenerics * 0.12.1 2015-04-08 Bioconductor
## car * 2.0-25 2015-03-03 CRAN (R 3.1.2)
## class * 7.3-11 2014-07-21 CRAN (R 3.1.2)
## CoxHD 0.0.56 2015-10-03 Github (mg14/CoxHD@e2e5622)
## devtools 1.7.0 2015-01-17 CRAN (R 3.1.2)
## digest * 0.6.8 2014-12-31 CRAN (R 3.1.2)
## DT 0.0.21 2015-03-19 Github (rstudio/DT@ee9f602)
## evaluate * 0.5.5 2014-04-29 CRAN (R 3.1.2)
## formatR * 1.0 2014-08-25 CRAN (R 3.1.2)
## glmnet 1.9-8 2014-05-24 CRAN (R 3.1.2)
## graph 1.44.1 2015-04-09 Bioconductor
## HilbertVis 1.24.0 2015-03-03 Bioconductor
## htmltools * 0.2.6 2014-09-08 CRAN (R 3.1.2)
## htmlwidgets 0.3.2 2014-12-09 CRAN (R 3.1.2)
## KernSmooth * 2.23-13 2014-09-14 CRAN (R 3.1.2)
## knitr 1.9 2015-01-20 CRAN (R 3.1.2)
## lattice 0.20-29 2014-04-04 CRAN (R 3.1.2)
## lme4 * 1.1-7 2014-07-19 CRAN (R 3.1.2)
## magrittr * 1.5 2014-11-22 CRAN (R 3.1.2)
## MASS 7.3-35 2014-09-30 CRAN (R 3.1.2)
## Matrix 1.1-4 2014-06-15 CRAN (R 3.1.2)
## mg14 0.0.4 2015-10-27 Github (mg14/mg14@ae963a1)
## mgcv * 1.8-3 2014-08-29 CRAN (R 3.1.2)
## minqa * 1.2.4 2014-10-09 CRAN (R 3.1.2)
## msSurv 1.2-1 2015-03-02 CRAN (R 3.1.2)
## nlme * 3.1-118 2014-10-07 CRAN (R 3.1.2)
## nloptr * 1.0.4 2014-08-04 CRAN (R 3.1.2)
## nnet * 7.3-8 2014-03-28 CRAN (R 3.1.2)
## pbkrtest * 0.4-2 2014-11-13 CRAN (R 3.1.2)
## quantreg * 5.11 2015-01-11 CRAN (R 3.1.2)
## randomForestSRC 1.6.0 2015-01-12 CRAN (R 3.1.2)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.1.2)
## Rcpp 0.11.4 2015-01-24 CRAN (R 3.1.2)
## rJava 0.9-6 2013-12-24 CRAN (R 3.1.2)
## RJSONIO * 1.3-0 2014-07-28 CRAN (R 3.1.2)
## rmarkdown * 0.5.1 2015-01-26 CRAN (R 3.1.2)
## rpart 4.1-8 2014-03-28 CRAN (R 3.1.2)
## rstudioapi * 0.2 2014-12-31 CRAN (R 3.1.2)
## SparseM * 1.6 2015-01-05 CRAN (R 3.1.2)
## stringr * 0.6.2 2012-12-06 CRAN (R 3.1.2)
## survAUC 1.0-5 2012-09-04 CRAN (R 3.1.2)
## survival 2.37-7 2014-01-22 CRAN (R 3.1.2)
## survivalROC 1.0.3 2013-01-13 CRAN (R 3.1.2)
## xlsx 0.5.7 2014-08-02 CRAN (R 3.1.2)
## xlsxjars 0.6.1 2014-08-22 CRAN (R 3.1.2)
## yaml * 2.1.13 2014-06-12 CRAN (R 3.1.2)
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-unknown-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel splines stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] devtools_1.7.0 abind_1.4-0 survivalROC_1.0.3 survAUC_1.0-5 randomForestSRC_1.6.0
## [6] rpart_4.1-8 Rcpp_0.11.4 msSurv_1.2-1 graph_1.44.1 HilbertVis_1.24.0
## [11] lattice_0.20-29 xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-6 htmlwidgets_0.3.2
## [16] DT_0.0.21 mg14_0.0.4 CoxHD_0.0.56 RColorBrewer_1.1-2 MASS_7.3-35
## [21] survival_2.37-7 glmnet_1.9-8 Matrix_1.1-4 knitr_1.9
##
## loaded via a namespace (and not attached):
## [1] BiocGenerics_0.12.1 car_2.0-25 class_7.3-11 digest_0.6.8 evaluate_0.5.5
## [6] formatR_1.0 htmltools_0.2.6 KernSmooth_2.23-13 lme4_1.1-7 magrittr_1.5
## [11] mgcv_1.8-3 minqa_1.2.4 nlme_3.1-118 nloptr_1.0.4 nnet_7.3-8
## [16] pbkrtest_0.4-2 quantreg_5.11 RJSONIO_1.3-0 rmarkdown_0.5.1 rstudioapi_0.2
## [21] SparseM_1.6 stats4_3.1.2 stringr_0.6.2 tools_3.1.2 yaml_2.1.13
Allaire, JJ, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, and Rob Hyndman. 2015. Rmarkdown: Dynamic Documents for R. http://CRAN.R-project.org/package=rmarkdown.
Burnett, Alan K, Anthony Goldstone, Robert K Hills, Donald Milligan, Archie Prentice, John Yin, Keith Wheatley, Ann Hunter, and Nigel Russell. 2013. “Curability of Patients with Acute Myeloid Leukemia Who Did Not Undergo Transplantation in First Remission.” J Clin Oncol 31 (10): 1293–1301. doi:10.1200/JCO.2011.40.5977.
Cancer Genome Atlas Research Network. 2013. “Genomic and Epigenomic Landscapes of Adult de Novo Acute Myeloid Leukemia.” N Engl J Med 368 (22): 2059–74. doi:10.1056/NEJMoa1301689.
Döhner, Hartmut, Elihu H Estey, Sergio Amadori, Frederick R Appelbaum, Thomas Büchner, Alan K Burnett, Hervé Dombret, et al. 2010. “Diagnosis and Management of Acute Myeloid Leukemia in Adults: Recommendations from an International Expert Panel, on Behalf of the European LeukemiaNet.” Blood 115 (3): 453–74. doi:10.1182/blood-2009-07-235358.
Eddelbuettel, Dirk, and Romain Francois. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. http://www.jstatsoft.org/v40/i08.
Ferguson, Nicole, Somnath Datta, and Guy Brock. 2012. “MsSurv: An R Package for Nonparametric Estimation of Multistate Models.” Journal of Statistical Software 50 (14): 1–24. http://www.jstatsoft.org/v50/i14.
Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” J Stat Softw 33 (1): 1–22.
Gerds, Thomas A, and Martin Schumacher. 2006. “Consistent Estimation of the Expected Brier Score in General Survival Models with Right-Censored Event Times.” Biom J 48 (6): 1029–40.
Gönen, Mithat, and Glenn Heller. 2005. “Concordance Probability and Discriminatory Power in Proportional Hazards Regression.” Biometrika 92 (4): 965–70. doi:10.1093/biomet/92.4.965.
Gray, Robert J. 1992. “Flexible Methods for Analyzing Survival Data Using Splines, with Applications to Breast Cancer Prognosis.” Journal of the American Statistical Association 87 (420): 942–51. doi:10.1080/01621459.1992.10476248.
Harrell, F E, Jr, K L Lee, and D B Mark. 1996. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Stat Med 15 (4): 361–87. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.
Ishwaran, Hemant, Udaya B Kogalur, Eugene H Blackstone, and Michael S Lauer. 2008. “Random Survival Forests.” The Annals of Applied Statistics 2 (3). JSTOR: 841–60. doi:10.1214/08-AOAS169.
Martincorena, Iñigo, and others. 2015. “Manuscript in Preparation.”
Meinshausen, Nicolai, and Peter Bühlmann. 2010. “Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4). Blackwell Publishing Ltd: 417–73. doi:10.1111/j.1467-9868.2010.00740.x.
Nagelkerke, N. J. D. 1991. “A Note on a General Definition of the Coefficient of Determination.” Biometrika 78 (3): 691–92. doi:10.1093/biomet/78.3.691.
O’Quigley, John, Ronghui Xu, and Janez Stare. 2005. “Explained Randomness in Proportional Hazards Models.” Stat Med 24 (3): 479–89. doi:10.1002/sim.1946.
Papaemmanuil, Elli, Moritz Gerstung, Luca Malcovati, Sudhir Tauro, Gunes Gundem, Peter Van Loo, Chris J Yoon, et al. 2013. “Clinical and Biological Implications of Driver Mutations in Myelodysplastic Syndromes.” Blood 122 (22): 3616–27. doi:10.1182/blood-2013-08-518886.
Papaemmanuil, Elli, Moritz Gerstung, Richard Schlenk, and others. 2015. Submitted Manuscript.
Pastore, Friederike, Annika Dufour, Tobias Benthaus, Klaus H Metzeler, Kati S Maharry, Stephanie Schneider, Bianka Ksienzyk, et al. 2014. “Combined Molecular and Clinical Prognostic Index for Relapse and Survival in Cytogenetically Normal Acute Myeloid Leukemia.” J Clin Oncol, Apr. doi:10.1200/JCO.2013.52.3480.
Perperoglou, Aris. 2014. “Cox Models with Dynamic Ridge Penalties on Time-Varying Effects of the Covariates.” Stat Med 33 (1): 170–80. doi:10.1002/sim.5921.
Potapov, Sergej, Werner Adler, and Matthias Schmid. 2012. SurvAUC: Estimators of Prediction Accuracy for Time-to-Event Data. http://CRAN.R-project.org/package=survAUC.
Schall, Robert. 1991. “Estimation in Generalized Linear Models with Random Effects.” Biometrika 78 (4): 719–27. doi:10.1093/biomet/78.4.719.
Schmoor, C, W Sauerbrei, and M Schumacher. 2000. “Sample Size Considerations for the Evaluation of Prognostic Factors in Survival Analysis.” Stat Med 19 (4): 441–52.
Shah, Rajen D., and Richard J. Samworth. 2013. “Variable Selection with Error Control: Another Look at Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 (1). Blackwell Publishing Ltd: 55–80. doi:10.1111/j.1467-9868.2011.01034.x.
Simon, Noah, Jerome H. Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5): 1–13. http://www.jstatsoft.org/v39/i05.
Therneau, Terry. 2012. Coxme: Mixed Effects Cox Models. http://CRAN.R-project.org/package=coxme.
———. 2014. A Package for Survival Analysis in S.
Therneau, Terry M, Patricia M Grambsch, and V. Shane Pankratz. 2003. “Penalized Survival Models and Frailty.” Journal of Computational and Graphical Statistics 12 (1): 156–75. doi:10.1198/1061860031365.
Therneau, Terry, Beth Atkinson, and Brian Ripley. 2014. Rpart: Recursive Partitioning and Regression Trees. http://CRAN.R-project.org/package=rpart.
Uno, Hajime, Tianxi Cai, Lu Tian, and L. J Wei. 2007. “Evaluating Prediction Rules for T-Year Survivors with Censored Regression Models.” Journal of the American Statistical Association 102 (478): 527–37. doi:10.1198/016214507000000149.
van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. http://www.jstatsoft.org/v45/i03/.
Wood, Simon N. 2013. “A Simple Test for Random Effects in Regression Models.” Biometrika 100 (4): 1005–10. doi:10.1093/biomet/ast038.
Xie, Yihui. 2015. Knitr: A General-Purpose Package for Dynamic Report Generation in R. http://CRAN.R-project.org/package=knitr.
3.3.6 Comments
In the absence of an established estimator of the joint density Eq.\(\eqref{eq:joint-density}\), we assumed that each factor of the density may be separately estimated using a random effects model. We note that the interdependence of observed events could in general introduce a bias as the censoring is not independent. The precise magnitude of this effect still needs to be investigated.
We observed, however, a good consistency of the average predictions and static multistage probabilities, indicating that those biases, on average, tend to cancel. Moreover cross-validation of our methodology ascertained a very good predictive performance despite all potential shortcomings.